5 framework simulation neural nets linden german national research center computer science germany abstract field software neural networks rapidly years importance provide increasing levels design simulation analysis neural networks framework intend show high degrees transparency complex experiments obtained basic sign philosophy inspired natural researchers explain computational models experiments performed networks building blocks extended mechanisms integrated facilitate construction analysis complex architectures automatic configuration building blocks experiment multiple runtime introduction recent years work development simulation systems neural networks importance largely future software environments provide increasing levels assistance design simulation analysis neural networks pattern signal processing architectures large improvements order fulfill growing demands research community existence software deal multiple learning paradigms applications linden large experiments paper describe object oriented framework simulation neural networks illustrate flexibility transparency prototype called implemented unix workstations running consists lines code imple classes neural network algorithms pattern handling graphical output utilities philosophy design main objective arbitrary combinations learning pattern processing paradigms supervised unsupervised supervised reinforcement learning application domains tern recognition vision speech control degree design based observation researchers explain neural formation processing systems nips block diagram consists group primitive elements building blocks building block inputs outputs functional relationship connections flow data building blocks scripts related building blocks flow control complex nips constructed library building blocks possibly nips interconnected uniform communication links design features building blocks share list common components build endpoints communication links data weight matrices activation vectors links tion functions process input update internal state compute outputs performing weight updates propagating activation error vectors command provide uniform user interface build blocks scripts control execution action command functions scripts conditional statements loops control structures symbol table runtime access parameters building block learning rates sizes data ranges internal data struc tures routines provided administration building blocks description experiment divided functional description building blocks highlevel language connection topology building blocks control flow defined scripts parameters building blocks framework simulation neural nets design highlights user interface user interface text oriented driven implies command user choose command file called easy arbitrary user interface structures primitive batch interface large offline graphical user interface online experiments consequence experiments command user interface user easily switch description files previously saved experiments interactive manipulation loaded complete structure experiment accessible runtime means manipulation parameters includes modification experiment topology experienced user include building blocks experiment observation statistical eval connect point communication structure deletion building blocks modifying control scripts complete state experiment current values relevant data saved experiments hierarchies distinguish kinds building blocks terminal nonterminal blocks nonterminal building blocks structure complex experiment hierarchies abstract building blocks substructures experi ment hierarchies substructures terminal building blocks provide data structures primitive functions scripts nonterminal blocks compose network algorithms nonterminal build block internal structure abstract sites scripts interface appears terminal building block construction experiment struction equivalent building single nonterminal building block complete experiment structure construction building blocks functionality extended approaches terminal building blocks programmed deriving existing classes nonterminal building blocks previously defined building blocks programming terminal building blocks terminal building blocks designed derivation existing classes complete administration structure predefined properties parent classes order properties action functions symbols basic opera tions provided framework note algorithms linden structures added class framework composing nonterminal building blocks nonterminal building blocks combined designed terminal nonterminal blocks building blocks build multilayer building block contexts define interface building blocks adjacent levels experiment hierarchy flow data inside building block controlled scripts call action functions scripts components abstract building block saved library reuse experiments building block leaving possibilities cope large complicated experiments deriving nonterminal building blocks powerful mechanism organizing complex experiments allowing high degrees flexibility reuse offered concept basic mechanism executes description parent building block description derived block additional added runtime formal parameters block andor refined multiple general function approximator points complex architecture implemented abstract base building block basic structure input output basic operations propagate input train derivations implement algorithm structure statistical routines visualization pattern handling utilities added basic function approximator parameters generic building blocks building block define formal parameters user figure time instantiation inclusion nonterminal building block nonterminal building blocks generic parameterized types interior building blocks names scripts mechanism multilayer created arbitrary type node weight layers user defines experiment parameters important redundant parameters depend building blocks determined automatically constraint satisfaction process mechanism avoid specification redundant information check experiment parameters consistency framework simulation neural nets enables construction generic structures communication links building blocks check data matching types building blocks impose additional constraints data sites constraints formed information base types dimensions sizes ranges data sites primary source information parameters building blocks time instantiation building experiment propagation mechanism iteratively complete missing information order satisfy constraints information determined building block experiment spread experiment topology building block loads patterns file dimensionality patterns automatically building blocks holding weight layers multilayer network considered finding unique solution equations cases occur inconsistency contradiction information sources site insufficient information site cess unique solution proof erroneous design user missed experiment observation graphical output file statistical analysis performed normal building blocks comprise network algorithms features built specialized utility building blocks integrated point experiment topology experiment runs classes building blocks supports rich building blocks experiment construction neural networks building blocks complete node weight layers construct multilayer networks chosen efficient computation building blocks single neurons level abstraction captures flexibility paradigms nips terminal building blocks complete classes neural nets provided efficiency demand mathematical building blocks perform arithmetic trigonometric general mathematical transformations scaling normalization building blocks coding provide functionality encode decode patterns utility building blocks provide access input output files dealt unix processes means simply store structured unstructured patterns make randomly accessible graphical building blocks display kind data matter weight matrices activation error vectors involved consequence abstract view combining building blocks functionality uniform data interface special building blocks analysis clustering averaging error analysis plotting statistical evaluations linden finally simulations cartpole incorporated build blocks realworld applications software accessed specialized interface blocks examples illustrative examples experiments found additional complex examples full tion software sketch briefly paradigms applications domains easily natural consequence design figure shows part experiment robot controlled modified kohonen feature potential field path planner building blocks workspace planner form main part experiment workspace simulation controlled robot graphical display feature transform proposed planner robot configurations trained experiment configuration space robot planner stored positions obstacles respect experiment configuration obstacle saved results earlier experiments library feature maps form terminal building blocks details complicated structure views visualize experiment buffers provide start values experiment runs shown generates control inputs workspace simply performing vector subtraction subsequently proposed state vectors robot simulation designed neural network simulator cope demands imposed current lines research implementation offers high degree flexibility experimental setup building blocks combined build complex experiments short development cycles simula framework mechanisms detect errors experiment setup provide parameters generic prototype built main research tool neural network experiments constantly refined future developments provide graphical elegant mechanisms reuse predefined building blocks research issues experiment parts optimize performance software preliminary obtained provide support moment acknowledgments numerous users work valuable discussions hints framework simulation neural nets references neural network programming environment editor advanced neural computers pages amsterdam elsevier science publishers northholland rochester connectionist simu technical report revised computer science dept university rochester brown simulation artificial neural network models software paradigm proceedings international joint conference neural networks pages washington user manual tutorial diego lange develop ment environment simulating hybrid connectionist architectures ceedings eleventh annual conference cognitive science society arbor august linden combining multiple neural network paradigms plications proceedings joint conference neural networks ijcnn baltimore ieee miyata users guide version tool constructing running network network environment development modular neural networks proceedings neural network conference paris pages ieee wilson bower genesis system simulating neural networks david touretzky editor advances neural information processing systems pages morgan kaufmann collected papers ieee conference neural information cessing systems natural synthetic denver november recent developments neural network simulator spie conference applications artificial neural networks april linden figure integration terminal building blocks nonterminal building block backpropagation figure robot control hybrid controller
12 evolving learnable languages bradley dept comp elec engineering university queensland queensland australia alan blair department computer science university australia wiles dept comp elec engineering school psychology university queensland queensland australia abstract recent theories suggest language acquisition evolution languages forms easily learnable paper evolve combinatorial languages learned recurrent neural network quickly amples additionally evolve languages generalization worlds generalization specific examples find languages evolved facilitate forms impressive generalization biased general pose learner results provide empirical support theory language language environment learner plays substantial role learning language acquisition language acquisition device introduction factors language learnability exploring issues language learnability special abilities humans learn complex languages emphasized dominant theory based domainspecific learning mechanisms specifically tuned learning languages argued strong constraints learning mechanism complex syntax language learned sparse data child observes recent theories challenge claim interaction learner environment addition theories proposal infants languages adapt human learners survive languages date empirical studies explored adaptation language facilitates learning elman demonstrated evolving learnable languages classes past forms evolve simulated generations response frequency neural networks showed symbolic system compositional languages emerge learning constrained limited examples evolved recurrent networks communicate simple structured concepts argument humans general purpose learners current research questions require exploring nature extent biases learners bring language learning ways languages exploit biases previous theories suggesting aspects language strong biases gradually breaking aspects language shown learnable weaker biases studies include investigation languages exploit biases subtle attention memory limitations children complementary study shown general purpose learners evolve biases form initial starting weights facilitate learning family recursive languages paper present empirical paradigm continuing exploration tors contribute language learnability paradigm propose evolution languages comprising recursive sentences symbolic strings languages sentences conveyed combinatorial composition symbols drawn finite alphabet paradigm based specific natural language simplest task find illustrate point languages compositional structure evolved learnable sentences simplicity communication task analyze language highlight nature generalization properties start evolution recursive language learned easily sentences biased learner address issues robust learning evolved languages showing languages support generaliza tion ways address factor regard paid languages evolve learners easily specific concepts learning paradigms sample randomly training domain human languages learnable random sentences easily examples child exposed environment series simulations test language adapt learnable core concepts paradigm exploring language learnability simple language task recurrent neural networks communicate concept represented point unit interval symbolic channel encoder network sends sequence symbols thresholded outputs concept decoder network receives processes back concept framework greater detail communication successful decoders output approximate encoders input concepts architecture encoder recurrent network input unit output units recurrent connections output hidden units back hidden units encoder produces sequence symbols states output units symbol concept encode network blair wiles iiii figure hierarchical decomposition language produced encoder symbols produced appearing root tree ordering leaves tree represent input space smaller inputs encoded sentences left examples train decoder found evolution decoder generalize branches order learn task decoder generalize systematically states tree including generalizing symbols positions sequence figure shows sequence states successful decoder presented sequence inputs step output units network assume states output greater denoted saturation highest activations remainder denoted output produced propagation propagation continues steps output units assume state decoder recurrent network input units single output recurrent hidden layer work shown conflicting straints encoder decoder easier decoder process strings reverse order produced encoder input decoder reverse output decoder remains symbol clarity strings written order produced encoder input pattern presented decoder matches output encoder units active network trained backpropagation time produce desired presentation final symbol sequence simple hillclimbing evolutionary strategy twostage evaluation function evolve initially random encoder produces language random decoder learn easily examples evaluation encoder current addition gaussian noise weights performed criteria network produce greater variety sequences range inputs decoder initially small random weights trained encoders output yield lower sumsquared error entire range inputs encoder paired single decoder initially random weight pair successful process repeated encoders input space continuous impossible examine input range approximated uniformly distributed examples final output language generated encoder found evolving learnable languages evolving easily learnable language humans learn sparse data series simulations test compositional language evolved learners reliably effectively learn examples training examples expect decoder learn task task hard language restricted sequences discrete symbols describe continuous space note simple linear interpolation symbolic alphabet languages recursive solutions unable learned unbiased learner decoder learner simulations showed performed arguments based learnability theory predict languages evolved hillclimbing algorithm outlined generations language random decoders trained conditions evolution examples epochs runs encoders decoders hidden units evolved languages learnable decoders minimum imum learner effectively learned language points space encoders employed average sentences minimum maximum communicate points training examples decoder sampled randomly decoders faced difficult generalization tasks difficulty task demonstrated language analyzed figures evolved languages contained similar compositional structure language figures inherent biases decoder minimal sufficient learning compositional structure evolving languages generalization series simulations demonstrate find languages biased learner generalize examples simulations languages evolved facilitate specific forms general ization users section considered case decoders required output encoders input setup yields approximation line figure compositional structure evolved languages decoder generalize unseen regions space series simulations relationship structure decoder required generalize association studied altering desired relationship encoders input decoders output sets languages evolved requiring identity section function resembling series steps random heights random step conditions section exception training examples generations completion evolution decoders trained final languages conditions generation represents creation variable encoder subsequent training decoder language reliably learnable random decoders effectively learn epochs index array based magnitude blair figure decoder output symbols message language figure xaxis encoders input yaxis decoders output point sequence points decoder trained shown crosses graph symbol decoder outputs values symbol outputs subsequent symbols string finer output note output constructed monotonically symbol providing closer approximation target function recursively approximating linear target final position sequence structure inherent sequences system generalize parts space note generalization based interpolation symbol values compositional structure sine function cubic function results show languages evolved enhance generalization prefer world average languages performed tested world evolved worlds languages evolved identity mapping average learned decoders trained identity task compared random step case languages evolved random step task learned decoders trained random step task trained identity task decoders generally performed poorly cubic function decoder learned sine task evolved languages series simulation show manner decoder generalizes restricted task section languages evolve facilitate generalization decoder ways minimal biases evolving learnable languages generalization core concepts simulations randomly selected concepts train decoders cases pathological distribution points made learning extremely difficult contrast human children learn language based common core concepts series simulations tested selecting training concepts positive affect success evolved language simulations alternative generalization functions section decoders difficulty generalizing sine function encoders evolved specifically sine task systems random decoders successfully learned evolved language specifically chosen points generalization sine function hundred decoders trained resulting language points random networks trained fixed learned compared networks trained random sets language evolves communicate restricted concepts unusual simulation shows surprising result language evolve generalize specific core concepts language case sine function discussion series simulations show compositional language learned strings recurrent network generalization performance included correct decoding branches symbols positions figure series simulations highlight language evolved facilitate forms generalization decoder final simulation demonstrates languages tailored generalize specific examples series simulations modify language environment decoder ways relationship utterances meaning type generalization required decoder utterances meanings learner exposed case language environment learner exploit minimal biases present learner taking approach similar giving learner additional bias form initial weights effective purpose simulations investigate strongly external factors assist simplifying learning conclusions understanding language learnability context language training young language learners lies process remote language change rate evolutionary change learning structure appears compared time takes child develop language bilities process crucial understanding child learn language surface appears complex poorly taught blair wiles paper studied ways languages adapt learners running simulations language evolution process contribute additional ponents list aspects language learned learners recursive structure learned examples languages evolve facilitate generalization evolve easily learnable common sentences simulations paper enhancement language learnability achieved learners environment adding biases language acquisition device acknowledgements work supported bradley postdoctoral alan blair grant wiles references language mind york elman johnson parisi connectionist perspective development press boston symbolic species language brain company york fitness selective adaptation language editors approaches evolution language bridge university press cambridge language organism implications evolution acquisition language unpublished manuscript february elman learning morphological change cognition syntax natural selection emerges vocabulary population learners editors evolutionary emergence language function linguistic form cambridge university press cambridge computational simulations emergence grammar editors approaches evolution pages cambridge university press cambridge elman learning development neural networks importance small cognition biases critical periods combining evolution learning acquisition syntax brooks editors proceedings fourth artificial life workshop pages press blair wiles neural encoders decoders dont talk backwards newton editors simulated evolution learning volume lecture notes artificial intelligence springer
0 stability results neural networks department electrical computer engineering university abstract present paper survey utilize results qualitative theory large scale interconnected dynamical systems order develop qualitative theory field model neural networks approach view networks inter connection single neurons results terms qualitative properties individual neurons terms properties structure neural networks aspects neural networks address include asymptotic stability exponential stability instability equilibrium estimates trajectory bounds estimates domain attraction asymptotically stable equilibrium stability neural networks structural perturbations introduction recent years neural networks attracted considerable attention candidates computational systems types largescale dynamical systems analogy biological structures advantage distributed information processing inherent potential parallel computation design computational systems entails detailed understanding dynamics largescale dynamical systems stability instability properties equilibrium points networks interest extent domains attraction basins attraction trajectory bounds present paper apply survey results qualitative theory large scale interconnected dynamical systems order develop qualitative theory neural networks concentrate popular hopfield model type analysis applied models address problems determine stability properties equilibrium point specific equilibrium point neural network asymptotically stable establish estimate domain attraction initial condi tions external inputs establish estimates trajectory bounds give conditions instability equilibrium point investigate stability properties structural perturbations present paper local results detailed treatment local stability results found global results contained arriving results present paper make method anal ysis advanced specifically view high dimensional neural network work supported grant work supported grant american institute physics interconnection individual subsystems neurons interconnected systems view point makes results distinct derived literature results terms qualitative properties free subsystems individual neurons disconnected network terms properties structure neural network results constitute design tools approach makes systematic analysis high dimensional complex systems frequently enables circumvent difficulties encountered analysis systems conventional methods structure paper start defining field model introduce interconnected systems viewpoint present representative stability results including estimates trajectory bounds domains attraction results instability conditions stability structural finally present concluding remarks hopfield model neural networks present paper neural networks hopfield type systems represented equations form usual continuous continuously differentiable strictly increasing denotes capacitance denotes resistance possibly sign inversion inverter denotes amplifier nonlinearity denotes input frequently assumed explicitly stated interested behavior solutions points rest positions setting inputs define equilibrium provided locations determined interconnection pattern neural network parameters nature nonlinearities isolated equilibrium neighborhood equilibrium analyzing stability properties equilibrium point assume loss located origin employed shifts equilibrium point origin leaves structure interconnected systems viewpoint find convenient view system interconnection free systems isolated subsystems equations form viewpoint structure system method analysis advanced establish stability results terms properties free subsystems terms properties structure method makes circumvent difficulties arise analysis complex highdimensional systems results obtained manner frequently yield insight dynamic behavior systems terms system ponents interconnections general stability conditions demonstrate result exponential stability point principal lyapunov stability results systems presented chapter utilize hypotheses result system external inputs system interconnections satisfy estimate real constants exists test matrix negative definite defined position state prove result theorem equilibrium neural network exponentially stable hypotheses satisfied proof choose function function positive definite derivative solutions invoked view time matrix matrix denotes largest eigenvalue real symmetric matrix assumption negative definite neighborhood origin mini maxi equilibrium neural network exponentially stable theorem consistent philosophy viewing neural network free subsystems lyapunov function consisting weighted lyapunov functions free subsystem weighting vector flexibility emphasize relative importance qualitative properties individual subsystems hypothesis measure interaction subsystems emphasized theorem require parameters form symmetric matrix weak coupling conditions test matrix hypothesis offdiagonal terms positive special case offdiagonal terms test matrix nonnegative equivalent stability results obtained easier apply theorem results called conditions literature conditions reflect properties system consequence process proof subsequent result make properties matrices chapter addition assumptions system nonlinearity satisfies sector condition successive principal test matrix positive defined defined theorem equilibrium neural network asymptotically hypotheses true proof proof proceeds lines similar theorem time lyapunov function lyapunov function reflects interconnected nature system note lyapunov function viewed generalized hamming distance state vector origin estimates trajectory bounds general interested questions stability equilibrium system performance assessing properties neural system investigating solution bounds equilibrium interest present result assuming hypotheses theorem satisfied require external inputs make additional assumptions assume exist defined defined assume system constant defined proof theorem make comparison result scalar comparison equation form continuous prove auxiliary theorem denote maximal solution comparison equation continuous function satisfies differential inequality long exist proof result comparison theorems theorem adopt notation mini defined denote solution position prove result bounds solution theorem assume hypotheses satisfied provided proof choose lyapunov function solutions obtain test matrix note satisfied present satisfied note manipulations show yields comparison solution obtain comparison result desired estimate true provided estimates domains attraction neural networks type considered equilibrium points equilibrium asymptotically stable exponentially stable extent stability interest usual assume equilibrium interest denotes solution network points true origin points makes domain attraction basin attraction equilibrium general determine domain techniques devised estimate subsets domain attraction apply method neural networks making theorem technique applicable results making modifications assume hypotheses satisfied free subsystem choose lyapunov function satisfied negative definite contained domain attraction equilibrium free subsystem obtain estimate domain attraction neural network lyapunov function easy matter show subset domain attraction neural network order obtain estimate domain attraction present method choose optimal fashion reader referred literature methods accomplish discussed instability results equilibrium points neural network unstable present sample instability theorem viewed counterpart theorem results formulated counterparts stability results type considered obtained making modifications system interconnections satisfy estimates successive principal test matrix positive stable subsystems unstable subsystems position prove result theorem equilibrium neural network hypotheses satisfied addition denotes empty equilibrium completely unstable proof choose lyapunov function solutions proof theorem defined conclude negative definite neighborhood origin point equilibrium unstable function negative definite equilibrium completely unstable chapter stability structural perturbations specific applications involving adaptive schemes learning algorithms neural networks interconnection patterns external inputs changed yield evolution sets desired asymptotically stable equilibrium points domains attraction present diagonal dominance conditions hypothesis constraints guarantee desired equilibria desired stability properties specific assume neural network designed interconnections strengths varied values express writing place assume neural network things arranged manner desired true mini previously clear diagonal dominance conditions satisfied equilibrium asymptotically stable important recognize condition constitutes stability condition neural network structural perturbations strengths interconnections neural network manner achieve desired equilibrium points satisfied asymptotically stable stability structural perturbations nicely concluding remarks present paper applied results qualitative theory large scale interconnected dynamical systems order develop qualitative neural networks field type results local information analysis equilibrium criteria exponential stability asymptotic stability instability equilibrium networks devised methods estimating domain attraction asymptotically stable equilibrium estimating trajectory bounds networks showed stability results applicable systems structural perturbations experienced neural networks adaptive learning schemes arriving results viewed neural networks interconnection single neurons results terms qualitative proper ties free single neurons terms network structure viewpoint suited study hierarchical structures naturally lend implementations vlsi type proach makes circumvent difficulties arise analysis synthesis complex high dimensional systems references review neural networks computing denker editor american institute physics conference proceedings snowbird hopfield tank science hopfield proc natl acad hinton anderson editors parallel models associative memory kohonen selforganization associative memory springerverlag miller qualitative analysis large scale dynamical systems academic press miller ordinary differential equations academic press bell system tech ieee trans automatic control submitted publication ieee trans syst press carpenter cohen grossberg science power system stability amsterdam north holland miller circuits systems signal processing stability largescale systems structural singular perturbations
7 financial applications learning hints abumostafa california institute technology abstract basic paradigm learning neural networks learning examples training inputoutput examples teach network target function learning hints eralization learning examples additional information target function incorporated learning process information common sense rules special financial market applications train data noisy hints advantage demonstrate hints trading versus british german mark japanese swiss period months explain general method learning hints applied markets learning model method restricted neural networks introduction neural network learns target function examples training data function sees data financial market applications typical limited amount relevant training data high noise levels data information content data modest learning process make create information poses fundamental limitation abumostafa learning approach neural networks models simple rules moving average elaborate system learning hints abumostafa feature learning examples information content data method prior knowledge target function common sense training data learning process types hints application paper give experimental evidence impact hints learning performance explain method detail enable readers hints markets simple hints result significant improvement learning performance figure shows learning performance foreign exchange trading symmetry hint section closing price history plots percentage returns cumulative daily transaction cost included sliding test window period april november averaged major markets runs error upper left corner standard long based trading days assuming independence runs plots establish statistically significant differential performance hints differential holds average hint average test number figure learning performance hint goal hints information training data differential performance dramatic start informative training data similarly additional hint pronounced effect financial applications learning hints hints application saturation performance market reflects future past efficient market hypothesis saturation performance hints make market saturation level enable approach level learning paper organized section characterizes notion noisy data defining performance range argue extra information financial market applications pronounced tern recognition applications section discuss method learning hints give examples types hints explain represent hints learning process section result details hint major markets section experimental evidence information content hint regularization effect results performance differential observe financial data section characterization noisy data applies markets broad treatment neuralnetwork applications financial markets reader referred abumostafa information input market target input neural network figure illustration nature noise financial markets market system takes information news events produces output price movement simplicity model neural network attempts abumostafa simulate market figure takes input small subset information information modeled plays role noise concerned network determine target output based approximates output typical approximation correct slightly half time makes noisy agree time performance range contrast typical pattern recognition application optical character recognition agree time performance range poor performance poses problem range additional difficulty learning range range performance good performance learning distinguish good hypotheses based limited examples problem range number hypotheses good points huge contrast range good performance high number hypotheses good limited confidence hypothesis learned range learned range random trading policy making good weeks random character recognition system read correctly problem large examples large numbers make agree time coincidence financial data problem continuous evolution markets data represent patterns behavior longer hold relevant data training purposes limited fairly recent times noise training data information network learn function information needed hints means providing hints section give examples types hints discuss represent learning process describe simple hints reader method minimal effort detailed treatment abumostafa method concerned hint property target function instance symmetry hint markets applies versus german mark figure simple hint asserts pattern price history implies move market implication holds market viewpoint german mark viewpoint formally terms normalized prices hint translates invariance inversion prices symmetry hint valid ultimate test learning perfor mance affected introduction hint formulation hints financial applications learning hints experience common sense analysis market list valid properties market represent hints canonical form shortly proceed incorporate learning process improvement performance good hints german mark figure illustration symmetry hint markets canonical representation hints systematic task step representing hint choose generating virtual examples hint illustration suppose hint asserts target function function input hint form input generate virtual examples needed picking inputs hint represented virtual examples ready incorporated learning process examples target function notice function learned minimizing error measure ultimately enforcing condition virtual hint learned minimizing ultimately enforcing condition involves network minimizing ference outputs easy show backpropagation rumelhart generation virtual hint require knowing target function needed compute error hint fact artificial inputs fact target function crucial limited resource examples target function interested hints place hand hints examples target function employ hint examples instance infer hint representing hint examples easy simple hints software learning examples abumostafa illustrate represent common types hints common type invariance hint hint asserts pairs instance formalized pairs shifted versions represent invariance hint invariant pair picked virtual error related type hint monotonicity hint hint asserts pairs instance monotonically nondecreasing formalized pairs application monotonicity hint occurs extension personal credit person identical person makes credit line exceed represent monotonicity hint monotonic pair picked virtual error trading applied symmetry hint markets versus british german mark japanese swiss case closing prices preceding days inputs objective fitness function chose total return training simple filtering methods inputs outputs networks training consisted days test days show improved performance symmetry hint roughly speaking market half time trade takes days rate close hint hint returns include transaction cost spread average notice return objective function resulted fairly good return modest rate cross checks final section report experimental results aimed claim information content hint reason improved performance hint plays role constraint neural network learning restricts solutions network settle overfitting common problem learning examples restriction improve outofsample performance reducing overfitting akaike moody idea regularization informative role role symmetry hint experiments experiment uninformative hint noise hint random target output inputs examples symmetry hint figure contrasts performance noise hint real symmetry hint averaged notice performance noise hint close hint figure consistent notion uninformative hint regularization effect negligible financial applications learning hints noise hint real test number figure performance real hint versus noise hint false hint test number figure performance real hint versus false hint abumostafa experiment hint false hint place symmetry hint hint takes examples symmetry hint asserts figure contrasts performance false hint real symmetry hint false hint effect performance consistent hypothesis symmetry hint valid negation results worse performance hint notice transaction cost consideration plots works negative bias losses trading policies conclusion explained learning hints systematic method combining rules data learning process reported experimental results statistically significant improvement performance major markets resulted simple symmetry hint types hints simple ways learning enable readers hints markets acknowledgements acknowledge amir valuable input grateful abumostafa expert remarks references abumostafa learning hints neural networks journal complexity abumostafa method learning hints advances information processing systems hanson morgan kaufmann abumostafa proceedings neural networks capital markets pasadena california november fitting autoregressive models prediction inst star math moody effective number parameters analysis generalization regularization nonlinear learning systems advances neural information processing systems moody morgan kaufmann rumelhart hinton williams learning internal resentations error propagation parallel distributed processing rumelhart press weigend rumelhart huberman generalization weight elimination application forecasting advances neural information cessing systems lippmann morgan kaufmann
5 learning sequential tasks incrementally adding higher orders mark ring department computer sciences taylor university texas austin austin texas abstract incremental higherorder network combines properties found learning sequential tasks higher order connections incremental introduction units network adds higher orders needed adding units dynamically modify connection weights units weights timestep information previous step temporal tasks learned feedback greatly simplifying training unlimited number units added reach arbitrarily distant past experiments reber gram demonstrated orders magnitude recurrent networks introduction secondorder recurrent networks proven powerful trained complete back propagation time demonstrated fahlman recurrent network incrementally adds nodes recurrent cascadecorrelation algorithm superior recurrent networks incremental higherorder network presented combines advantages approaches network network simplified ring tinuous version introduced adds higher orders needed system solve task adding units dynamically modify connection weights units modify weights timestep information temporal tasks learned feedback general formulation unit network input output highlevel unit unit time input unit output unit target time higherorder unit modifies weight time output highlevel units collectively referred units timestep output highlevel units receive summed input input units gating function representing weight connection ticular timestep higherorder unit connection input unit added connections weight timestep exists timestep values output units calculated input units weights possibly modified activations highlevel units previous timestep values highlevel units calculated time output units generate output network highlevel units simply alter weights timestep unit activations computed simultaneously activations units required connection modified unit identical notational convenience higherorder connection usual sense righthand side equation equation replaces equation result fact network increases height higher orders introduced lower orders preserved learning sequential tasks incrementally adding higher orders timestep network arranged hierarchically higherorder units higher hierarchy units side weight affects higherorder units outgoing connections network recurrent impossible highlevel unit affect directly indirectly input hidden units traditional sense units linear activa tion function imply nonlinear functions represented nonlinearities result multiplication higherlevel input units equations learning gradient descent reduce sumsquared error learning rate timesteps weight affect networks output error equation rewritten constant unit specifies high hierarchy unit specifies timesteps takes change unit activation affect networks output space limitations derivation gradient shown resulting weight change rule weights changed error values output units collected highlevel unit higher hierarchy units side weight affects weight made bottom equation calculated time computed intuition learning rule highlevel unit learns utilize context previous timestep adjusting connection influences timestep minimize connections error context information decide correct connection timestep previous timestep information higherorder unit assigned connection needed information previous timestep units built information earlier steps method concentrating unexpected events similar hierarchy decisions history compression schmidhuber ring units unit added weight strongly opposite directions learning forcing weight increase decrease time unit created determine contexts weight direction averages connection records average change made weight longterm absolute deviation parameter specifies duration longterm average lower means average longer period time small large weight strongly conflicting directions unit built build unit small constant denominator threshold number units network related method adding units feedforward networks introduced unit added incoming weights initially output weights simply learns anticipate reduce error timestep weight modifies order number units unit created statistics connections destination unit reset results reber grammar small finitestate grammar form transitions node made labeled arcs task network input label traversed predict learning sequential tasks incrementally adding higher orders elman recurrent incremental network rtrl cascade higherorder correlation network sequences hidden units table incremental higherorder network compared recurrent works reber grammar results recurrent networks sources andor performance shown avail rtrl realtime recurrent learning algorithm traversed training sequence string generated starting transition randomly choosing leading current state final state reached inputs outputs encoded locally output units input units bias unit network considered correct highest activated outputs correspond arcs traversed current state note current state determined current input recurrent network learn task string presentations hidden units correctness criteria elman slightly previous recurrent cascadecorrelation learn task hidden units average string presentations incremental higherorder network trained continuous stream input network reset beginning string training considered complete network correctly classified strings criterion network completed training average string presentations standard deviation achieved perfect generalization test sets randomly generated strings runs reber grammar stochastic higherorder units imposed network prevent continually creating units attempt random number generator complete results network reber grammar task table parameter settings bias network perform bias unit network tested variable tasks introduced mozer shown figure tasks intended test performance networks long timedelays sequences alternately presented network sequence begins fixed string characters inserted number timesteps beginning figure number timesteps difference sequences begins repeats begins repeats network learn predict item sequence current item input ring timestep sequence sequence figure variable training sequence item sented network timestep target item sequence items sequence order correctly predict network remember sequence began inputs locally encoded order network predict occurrence remember sequence began length increased order create tasks greater difficulty results tasks table values standard recurrent network variation paper incremental higherorder difficulty gaps largest tested string tasks position repeated characters exception network continued scale linearly size terms units epochs required training tasks stochastic network stopped building units created solve task parameter settings bias network considered correctly predicted element sequence strongly activated output unit unit representing correct prediction sequence considered correctly predicted elements initial correctly predicted number training sets required standard mozer incremental units recurrent network higherorder created table comparison tasks standard network devised specifically long timedelays mozer reported results gaps incremental higherorder network column number units created incremental higherorder learning sequential tasks incrementally adding higher orders conclusions incremental higherorder network performed networks compared tiny tests order parameters tasks tasks network large number units contextdependent events inherently stochastic network principle build larger hierarchy searches back time context predict connections weight units needed bridge long finally bridge timedelay created generalize timedelays hand network learns fast simple structure adds highlevel units needed feedback unit produces signal feed back learning back propagation time outputs highlevel units fanin equal number inputs number connections system smaller number connections traditional network number hidden units finally network thought system continuousvalued condition action rules inserted removed depending rules turn inserted removed depending rules units added initially invisible system effect gradually learn effect opportunity decrease error presents acknowledgement work supported nasa johnson space center graduate student program training grant eric robert simmons discussions helpful comments paper technologies contribution computer time office space required complete work references jonathan richard modeling control finite state environments thesis department computer information sciences university massachusetts february cleeremans david servanschreiber james mcclelland finite state automata simple recurrent networks neural computation richard hierarchical organisation candidate principle editors growing points pages cambridge cambridge university press elman finding structure time technical report university california diego center research language april ring scott fahlman recurrent cascadecorrelation architecture lippmann moody touretzky editors advances neural information processing systems pages mateo california morgan kaufmann publishers giles miller chen chen extracting learning unknown grammar recurrent neural networks moody hanson lippman editors advances neural information processing systems pages mateo california morgan kaufmann publishers michael mozer induction multiscale temporal structure john moody steven hanson richard lippmann editors advances neural information processing systems pages mateo morgan kaufmann publishers jordan pollack induction dynamical recognizers machine learning mark ring incremental development complex behaviors construction sensorymotor hierarchies lawrence editors machine learning proceedings eighth workshop pages morgan kaufmann publishers june mark ring sequence learning incremental higherorder neural works technical report artificial intelligence laboratory univer sity texas austin january robinson fallside utility driven dynamic error propagation network technical report cambridge university engineering department rumelhart hinton williams learning internal resentations error propagation rumelhart mcclelland editors parallel distributed processing explorations microstructure cognition foundations press schmidhuber learning unambiguous reduced sequence descriptions moody hanson lippman editors advances neural information processing systems pages mateo california morgan kaufmann publishers raymond watrous kuhn induction finitestate languages secondorder recurrent networks moody hanson lippman editors advances neural information processing systems pages mateo california morgan kaufmann publishers ronald williams david zipser learning algorithm continually running fully recurrent neural networks neural computation mike node splitting constructive algorithm feedforward neural networks neural computing applications
11 facial memory kernel density estimation cottrell department computer science engineering diego jolla ucsd thomas department psychology university abstract compare ability memory models face stimulus representations account probability human subject responded facial experiment models generalized context model probabilistic sampling model model related kernel density estimation explicitly encodes stim representations positions stimuli face space projections test faces eigenfaces study representation based response grid gabor filter combinations model space predicts observed familiarity inversion effect subjects false alarm rate morphs tween similar faces higher rate studied faces evidence consistent hypothesis human faces kernel density estimation task faces require larger kernels typical faces background studying errors subjects make face recognition memory tasks standing mechanisms representations underlying memory face processing visual perception errors testing subjects recognition faces created studied faces combined recently examined extent morphs unfamiliar faces affect subjects tendency make recognition errors experiments facial images males morphs images facial memory kernel estimation figure normalized morphs database figure stimuli study press subjects rate similarity pairs large faces morphs performed multidimensional scaling similarity ratings derive face space study experiment subjects studied facial images including similar pairs determined study pairs included order study morphs similar faces dissimilar faces false alarms call pair images derived parents child experiments test phase subjects asked make judgments response morphs completely distractor faces targets parents morphs results pairs subjects responded studied parent effect familiarity inversion occurred morphs similar parents similar parents similar child morphs contribute false alarm response researchers proposed models account data explicit memory periments applied types models data largely negative results paper limit discussion models generalized context model models rely assumption subjects explicitly store representations stimuli study plied models experiment data models fully account observed similar familiarity inversion similar parents explicitly memory producing prototypes morphs extend submitted work applying exemplar models additional face stimulus representations propose exemplar model accounts similar morphs familiarity inversion results consistent hypothesis facial memory kernel density estimation bishop task distinctive exemplars require larger kernels basis model predict respect study critical factor kernel size opposed contextfree notion easily test prediction empirically experimental methods face stimuli normalization original images digitized grayscale images lighting background fairly consistent position subjects varied extent facial hair automatically located left eyes face simple template correlation technique translated rotated scaled image eyes aligned image scaled image speed image processing figure shows examples normalized morphs original images published cottrell representations positions multidimensional face space researchers scaling approach model phenomena face processing press subjects rate similarity pairs faces test performed multidimensional scaling similarity matrix faces target faces dropped analysis process resulted solution stress modeling results vector stimulus representation principal component projections eigenfaces eigenvectors covariance matrix face images common basis face representations turk pentland performed principal components analysis face images study experiment nonzero eigenvectors covariance matrix projected test images significant eigenfaces obtain vector representing face gabor filter responses malsburg colleagues made effective banks gabor filters orientations spatial frequencies face recognition tems form wavelet buhmann malsburg scales orientations square grid normalized face image basis face stimulus representation representation resulted vector face stimulus performed principal components analysis reduce dimensionality keeping representations dimensionality representations obtained vector based gabor filter responses represent test face image models generalized context model simple form lead directly modulated density estimation model version predicted representation test stimulus representations studied exemplars linearly convert summed similarity probability representations study stimuli narrow width similarity function euclidean distance weighted euclidean distance attentional weights constants intuitively model simply places function studied exemplars predicted familiarity test probe simply summed height surfaces location recall representations projection space gabor filter space allowing adaptive weights representation reasonable resulting model parameters points adaptive weights gabor space resulting models fitting parameters points models report results space adaptive weights report adaptive weight results models gabor space submitted proposed attempt poor predictions human data related eigenfaces number theoretical measure correlated measure space facial memory kernel density estimation representations space sampling exemplars idea model subject shown test stimulus summed comparison exemplars memory test probe probabilistically samples single exemplar memory subject responds similarity exemplar noisy criterion model similarity scaling parameter parameters describing noisy threshold function space limitations provide details model human data framework introduced prototypes locations morphs space made probability sampling prototype proportional similarity parents compare basic version blend exemplars mixture model memory model assume subjects study time implicitly create probability density surface training subjects probability responding probe proportional height surface point probe surface robust face variability noise typically encountered face recognition lighting perspective provide level discrimination support intervals representations single face overlap noise rational decision boundary constructed assume gaussian mixture model density surface built gaussian blobs centered studied exemplar task form kernel density estimation bishop formulate task predicting human subjects framework optimizing priors widths kernel functions minimize squared error prediction minimize number free parameters model parsimonious methods setting priors kernel function widths potentially lead insights principles underlying human data priors widths held constant simple parameter model predicting probability subject responds test stimulus uniform prior normalization constants stan dard deviation gaussian kernels ignore constants model essentially version results section show model fully account human familiarity data representational spaces improve model introduce parameters prior kernel function height standard deviation kernel function width vary studied exemplar modification intuitive humans asked parent faces similar parent distinctive parent typical subjects tend choose distinctive parent tanaka submitted hypothesize human asked study remember faces recognition test faces neighbors relaxed wider discrimination boundaries faces nearby neighbors representation space studied face computed ical face average distance nearest studied faces allowed height width kernel function vary case results weighted distance space cottrell model space weights projections gabor table rmse models representations quality models adaptive attentional weights reported lowdimensional representation weights baseline rmse achievable constant prediction parameter fitting model evaluation twelve combinations models face representations searched parameter space simple hill climbing parameter settings minimized squared error models predicted actual human data rate models effectiveness criteria measure models global rmse test points models rmse compared baseline performance model simply predicts human achieves rmse evaluate extent model predicts human response categories test stimuli parent targets distractors similar parents dissimilar parents similar morphs dissimilar morphs model correctly predicts rank ordering category means accounts similar familiarity inversion pattern human data long models adequate fitting human data measured rmse prefer models predict familiarity inversion effect natural consequence minimizing rmse results table shows global pair model space quantitative generally outperforms indi tight quantitative parameters linear transformation built model important allowing kernel function vary distinctive ness note projection representation consistently outperformed gabor representation space representation purposes degree model predicts human responses categories stimuli important good globally figure takes detailed model predicts human category means space global human familiarity ratings predict familiarity inversion similar morphs mixture model weighted space correctly predicts familiarity effect models human responses similar morphs discussion results mixture model consistent hypothesis facial memory kernel density estimation task distinctive exemplars require larger kernels true density estimation tend outliers sparse areas face space human data show priors kernel function widths outliers increased potentially significant problems work presented experimented models finding predict familiarity inversion effect fitting single facial memory kernel density estimation actual predicted figure average responses faces category dissimilar parents similar morphs targets similar parents dissimilar morphs distractors experiment model carefully tested data predictions empirically theoretical measure based sparseness face space exemplar sufficient account similar morphs familiarity inversion predict respect study critical factor kernel size contextfree human judgments easily test prediction subjects rate stimuli prior exposure determine ratings improve degrade models surprising aspect results model requires representation based human similarity judgments ideally prefer provide account representa tions projections gabor filter responses interestingly efficacy representations depend similar repre sentations projection representation performed worst distances pairs representations correlation distances pairs representations gabor filter representation performed relation future work plan investigate representation representation derived directly face images cottrell providing account human data future research include empirical testing predictions evaluating applicability model domains face processing evaluating ability modeling paradigms account data acknowledgements chris comments previous draft members research unit earlier comments work research supported part grant references bishop neural networks pattern recognition oxford university press oxford faces multidimensional face space logical science press submitted accounts face recognition journal experimental psychology learning memory cognition cottrell eigenfaces familiarity proceedings annual conference cognitive science society pages erlbaum retrieval model recognition recall psychological review buhmann malsburg size distortion invariant object recognition hierarchical graph matching proceedings ijcnn international joint conference neural networks volume pages attention similarity rela journal experimental psychology general errors combination stored stimulus features produce memory memory cognition prototype formation faces case pseudo memory british journal psychology tanaka giles simon submitted mapping attractor fields face space bias face recognition turk pentland eigenfaces recognition journal cognitive neuroscience exemplar model face processing effects journal experimental psychology
7 computational model cortex function todd dept psychology carnegie mellon univ pittsburgh jonathan cohen dept psychology carnegie mellon univ pittsburgh dept univ pittsburgh pittsburgh abstract accumulating data neurophysiology suggested information processing roles shortterm active memory inhibition present behavioral task computational model developed parallel task developed probe functions simultaneously produces rich behavioral data constraints model model implemented continuoustime providing natural framework study temporal dynamics processing task show model examine consequences neuromodulation specifically model make testable predictions behavioral performance hypothesized suffer reduced tone brain area introduction cortex area human brain significantly relative animals general consensus involved higher cognitive activities planning problem solving language recently specific information cessing mechanisms shortterm active memory inhibition active memory capacity nervous system maintain information form sustained activation states cell firing short periods time forms memory longer duration instantiated todd jonathan cohen david servanschreiber modified values physiological parameters synaptic strength decades large number neurophysiological studies focusing cellular basis active memory cortex studies neurons fire selectively specific stimuli response patterns remain active delay investigators argued data maintains temporary information needed guide behavioral responses sustained terns neural activity hypothesis consistent behavioral findings animal human lesion studies suggest required tasks involving delayed responses prior stimuli addition role active memory investigators focused inhibitory functions argued representations quired overcome previously reinforced response order mediate weaker response cohen servanschreiber observed lesions syndrome behavioral patients inappropriate ways cited evidence plays important role behaviors inappropriate active memory inhibition generally computational models play important role providing explain information processing functions arise computational models literature focused active memory zipser inhibitory functions functions servanschreiber models explaining role variety behavioral tasks wisconsin card sort earlier models limited inability fully ture dynamical processes underlying active memory inhibition specifically simulations tightly constrained temporal parameters found behavioral tasks durations stimuli delay periods response latencies limitation found solely models feature behavioral tasks tasks simulated structured ways facilitate dynamical analysis processing paper address limitations previous work describing behavioral task computational model developed parallel provide framework exploring temporal dynamics active memory inhibition consequences behavior describe framework examine effects believed play critical role normal functioning disorders schizophrenia behavioral assessment human function developed task paradigm incorporates components central function cortex shortterm active memory inhibition study dynamics processing task variant continuous performance test commonly study attention computational model cortex function behavioral clinical research standard version task letters presented time middle computer screen subjects instructed press target letter probe stimulus preceded stimulus previous versions subjects responded target trials present version task response procedure employed trials subjects asked press nontarget procedure response latencies evaluated trial providing information temporal dimensions processing task additional modifications made standard paradigm order maximally activity memory function tapped delay stimuli prior stimulus context decide respond probe short delay msec demand memory prior stimulus supported evidence lesions shown effect performance short delay longer delay msec maintain representation prior stimulus order context responding current ability contextual representations delay period determined comparing performance short delay trials long delays inhibitory function introducing response tendency overcome respond correctly tendency introduced task increasing frequency target trials remaining trials types distractors target probe letter target probe letter nontarget probe letter target trials occur time type distractor trial occurs time frequency targets development strong tendency respond target probe letter occurs identity response correct times ability inhibit response tendency examined comparing accu trials target occurs absence trials made target occurs trials provide measure response bias random responding trials target probe appears trials interesting respect function trials measure cumulative influence active representations context guiding responses functioning system context representations stabilize increase strength time progresses expected accuracy tend decrease long delay trials relative short mentioned primary benefit paradigm framework simultaneously probe inhibitory memory functions supported preliminary data laboratory suggests fact activated performance task simple structure task generates rich behavioral data stimulus conditions crossed delay conditions accuracy reaction time performance todd jonathan cohen david servanschreiber accuracy short delay accuracy long delay model short delay long delay trial condition trial condition model correct model incorrect data correct data incorrect figure subject behavioral data model performance superimposed panels accuracy delays conditions bottom panels reaction times correct incorrect responses conditions bars represent standard error measurement empirical data measured figure shows data gathered subjects performing task found accuracy unchanged long delays compared short demonstrating active memory adequately support performance accuracy slightly decrease long delays reflecting normal context representations time accuracy trials high supporting assumption subjects effectively context representations inhibit responses distinct pattern emerged latencies correct incorrect responses providing formation temporal dynamics processing responses trials slow correct trials fast incorrect pattern reversed data specific detailed information normal functioning constraints development evaluation computational model computational model developed recurrent network model produces detailed information temporal processing task network composed modules input module memory module output module memory module implements memory inhibitory functions believed carried figure shows diagram model unit input module represents stimulus condition computational model cortex function output layer input layer figure diagram model units input module make excitatory connections response module directly indirectly memory module lateral inhibition layer produces competition representations activity stimulus flows memory module responsible maintaining trace relevant context trial units memory module connections activity generated sustained absence input recurrent connectivity utilized unit module assumed simpler formally equivalent analogue fully connected recurrent cell assembly zipser type connectivity produce temporal activity patterns highly similar firing patterns neurons areas cortex activity input memory modules integrated output module output module determines target nontarget response made simulate task network architecture size simple order maximize models attempted simulate neural information processing neuron manner populations units capturing information processing characteristics larger populations real neurons capture stochastic distributed dynamical properties real neural networks small analytically tractable simulations simulation temporally continuous framework processing governed difference equation state unit total input timestep integration gain bias continuous framework preferable discrete plausible scale events appropriately exact temporal specifications task duration stimuli delay probe addition continuous character simulation naturally framework inferring reaction times conditions todd jonathan cohen david servanschreiber simulations behavioral performance continuous recurrent generalization backpropagation pearlmutter train network perform connection weights developed training procedure constraint layer weights forced positive layer weights forced negative training consisted repeated presentation conditions task long short delays presentation frequency condition matching behavioral task weights updated presentation trial biases fixed network trained completion training occurred network accuracy reached condition training weights fixed errors reaction time distributions simulated adding zeromean gaussian noise input unit time step trial presentation trial consisted presentation stimulus delay period probe stimulus mentioned duration events appropriately scaled match temporal parameters task msec duration probe presentation msec short delays msec long delays time constant msec simulation network scaling factor provided sufficient temporal resolution capture relationship task delays tractable simulating events responses determined noting output unit reached threshold presentation probe stimulus response latency determined calculating number time steps model reach threshold multiplied time constant facilitate comparisons experimental reaction times constant added values produced parameter correspond time required execute motor response determined squares data trials condition order obtain reliable estimate performance stochastic conditions standard deviation noise distribution threshold response units adjusted produce subject data figure compares results simulation behavioral data figure model good behavioral data pattern accuracy reaction times model matches qualitative pattern errors reaction times produces similar results match model experimental results striking considered total data points model fitting free parameters models ability successfully account pattern behavioral performance evidence captures essential principles processing task feel confident examining normal processing extending model explore effects specific disturbances processing behavioral effects neuromodulation previous meeting conference simulation simpler version discussed servanschreiber cohen simulation computational model cortex function accuracy short delay accuracy long delay model normal gain model reduced gain data controls figure model performance normal reduced gain graph illustrates effect reducing gain memory layer task performance baseline network network effects tone captured changing gain parameter network units gain thought correspond action modulatory modifying neurons input signals servanschreiber cohen servanschreiber current simulation offers opportunity explore effects neuromodulation information processing functions specific transmitter dopamine modulate activity manipulations dopamine shown effects neuronal activity behavioral performance hypothesized reductions dopamine responsible information processing schizophrenia simulate behavior subjects reduce gain units memory module network reduced gain memory module striking models performance task figure short delay conditions performance model similar control subjects intact model long delays model produces qualitatively pattern performance condition model high error rate error rate pattern opposite control subjects double performance robust effect simulation parameter adjustments model makes predictions highly testable specifically model predicts differences performance todd jonathan cohen david servanschreiber tween control subjects apparent long delays perform significantly worse control subjects trials long delays perform significantly control jects trials long delays prediction interesting fact tasks show superior performance relative controls rare experimental research model makes predictions behav performance offers explanations mechanisms analyses trajectories activation states memory module reveals performance failures maintaining representations context stimulus reducing gain memory module distinction signal noise context representations decay time result long delay trials higher probability model show inhibition errors memory errors conclusions results paper show computational analysis temporal dynam information processing understanding normal behavior developed behavioral task simultaneously inhibitory active memory functions task combination computational model explore effects making specific predictions performance predictions testing references cohen servanschreiber context cortex dopamine connectionist approach behavior biology schizophrenia psychological review simple model cortex function tasks journal cognitive neuroscience cortex york press circuitry primate cortex regulation behav representational memory handbook nervous system american physiological society modeling effects frontal lobe damage novelty neural networks pearlmutter learning state space trajectories recurrent neural networks neural computation dopamine receptors cortex working memory science servanschreiber cohen effect performance unit system behavior touretzky neural information processing systems mateo morgan kaufman frontal york press zipset recurrent network model neural mechanism shortterm active memory neural
2 neural network detect proteins yoshua bengio school computer science university montreal canada bengio montreal department biology university montreal institute montreal abstract order detect presence location domains amino acid sequences built system based neural network hidden layer trained back propagation program designed efficiently identify proteins exhibiting domains characterized localized conserved regions national biomedical research foundation protein sequence database scanned evaluate programs performance obtained rates false coupled moderate rate false positives introduction amino acid sequences proteins aligned amino acids identical similar chemical properties subsequences domains exhibit similar dimensional structure sequence similarity results common domains sets bound bengio bengio bonds characteristic structure domains found proteins involved cell receptor functions proteins collectively form review williams members possess domains domains characterized conserved groups amino acids localized specific regions poorly conserved domains members current search programs incorporating algorithms algorithm algorithm tion smith detecting domains implicitly amino acid equally portant case domains domain amino acids conserved vari solution problem search algorithms based occurrence position wang profile analysis programs published university wisconsin computer group rely algorithm profile analysis applied search domains holland output programs suffers high rate false positives variations domain length handled traditional method ties proportional number gaps introduced length sition approach entails significant amount spurious recognition considerable variation domain length accounted chosen address problems training neural network recognize accepted domains perceptrons types neural works previously biological research degrees success sejnowski results suited detecting sequence patterns characterize domains design training procedure simple search constitute valid solution problems searching proteins domain algorithm network design training network data existence localization highly conserved groups amino acids characteristic domain design similar respects neural networks study speech recognition bengio conserved designated domain identified roughly correspond domain williams amino acids groups necessarily conserved show distribution distribution generally observed proteins important stage system learns joint distributions program scans proteins window neural network detect proteins stage system consists feedforward neural network inputs hidden outputs figure trained back tion rumelhart results obtained recognition conserved regions architecture hidden layer similar perceptron stage evaluates based stream outputs generated stage region similar domain detected stage simple dynamic programming algorithm constraints order distance explicitly programmed force recognizer detect sequence high values threshold conserved regions correct order values obtained recognized regions greater threshold weak penalties applied violations distance constraints conserved gions distance based simple rules derived analysis domains rules impact strong detected program easily handles large variation domain size exhibited domains explicit constraints number training examples assumption distance groups critical discriminating factor assumed subsequences prob significantly influence discrimination output units representing features domain hidden units window scanning consecutive amino adds figure structure neural network bengio bengio input sequence epsilon chain region human starting ending score starting ending score starting ending score starting ending score figure sample output search domains present constant region epsilon chain file number listed position text score domain list training group proteins comprising domains williams order increase size training additional sequences stochastically generated substituting critical positions domain designed affect local distribution minimize chemical character region program evaluated optimized scanning protein protein version results presented based searches database noted generated cutoff complete sequences insects including scanned corresponds sequences present database trial runs program cutoff hold eliminates vast majority false positives effect rate false sample output listed results protein sequence database searched proteins identified possess domain scan proteins comprising database required average hours time comparable computationally intensive programs profile analysis computer similar searches required hours time sufficiently fast user alter cutoff threshold repeatedly searching proteins neural network detect proteins table output search protein sequence database domains sorted score class beta chain precursor human kappa chain region mouse human kappa chain region mouse growth factor receptor precursor mouse class alpha chain mouse class alpha cham mouse receptor alpha chain precursor human tcell surface precursor mouse transforming alpha chain region human alpha chain region human alpha chain region human factor precursor mouse class alpha chain precursor human chain chain class beta chain precursor class beta chain precursor human class beta chain precursor human class beta chain precursor human kinase mouse class chain precursor mouse tcell chain precursor region mouse tcell chain precursor region mouse tcell receptor chain precursor region mouse tcell receptor chain precursor region mouse tcell receptor chain precursor region mouse tcell receptor beta chain region mouse chain region mouse tcell beta chain region human class alpha chain precursor mouse class alpha chain mouse long form precursor short form precursor precursor brain precursor class alpha chain precursor class alpha chain precursor class alpha cham precursor mouse chain precursor region mouse tcell alpha chain precursor mouse tcell receptor delta chain region mouse tcell beta chain precursor region mouse tcell receptor chain precursor region human tcell receptor alpha chain precursor region human tcell surface precursor human tcell surface precursor human chain human hypothetical protein sodium channel protein heavy chain region frog protein mouse tcell receptor alpha chan human precursor hurt beta precursor human receptor alpha cham tcell class beta cham precursor mouse class chain precursor human class alpha chain precursor human chain cham region human alpha chain region precursor class precursor beta precursor human human receptor beta chain precursor human receptor precursor mouse hybrid receptor precursor region human heavy chain precursor region human heavy precursor region human cell precursor mouse region epsilon cham region receptor alpha chain region mouse human tcell receptor chain region mouse tcell receptor gamma chain region mouse chain precursor mouse epsilon chain region human chain region human heavy chain region mouse heavy chain region mouse heavy cham region mouse kappa chain region mouse heavy cham region mouse chain region mouse heavy cham region mouse heavy chain region mouse heavy chain region mouse beta precursor human precursor human receptor beta cham mouse receptor beta chain precursor region mouse sodium channel protein kappa cham mouse cham region mouse cham mouse kappa chain region mouse cham region mouse kappa cham mouse kappa chain regions mouse tcell receptor alpha chain precursor region mouse surface epsilon chain human tcell precursor mouse tcell surface precursor mouse tcell receptor alpha chain precursor region human tcell receptor chain region mouse tcell receptor gamma chain region human tcell receptor cham region human receptor cham region human heavy chain region mouse heavy chain region mouse heavy chain region mouse heavy chain precursor region mouse tcell receptor beta chain region human tcell receptor beta chain region mouse tcell receptor chain region human tcell receptor chain region human gamma chain region human kappa chain region mouse kappa chain region mouse kappa chain region mouse kappa chain region mouse kappa chain region human precursor human growth factor receptor precursor mouse protein tcell beta chain precursor region human tcell precursor kappa chain precursor region mouse precursor human heavy chain region mouse chain precursor human class cham precursor mouse class chain precursor mouse class beta chain human class beta chain precursor class beta chain precursor human class beta chain precursor human class beta chain human class beta chain precursor human chain region mouse chain region human heavy region mouse human heavy chain region mouse heavy chain regions mouse chain precursor region heavy chain region mouse heavy chain region mouse heavy chain region mouse receptor beta cham precursor region human heavy chain region mouse heavy chain region mouse chain region heavy chain region mouse tentative sequence human precursor human chain region human epsilon chain region human neural cell protein precursor mouse kappa chain region mouse kappa chain precursor region kappa chain region mouse kappa chain region mouse receptor beta chain region human heavy chain region mouse heavy chain region mouse tcell receptor alpha chain region mouse tcell receptor alpha chain mouse tcell receptor alpha chain precursor region mouse tcell receptor alpha chain precursor region mouse receptor alpha chain precursor mouse class alpha chain precursor human class alpha chain precursor human tcell receptor alpha chain precursor region human kappa chain precursor chain mouse neural cell protein precursor mouse heavy chain region mouse tentative kappa chain precursor region human kappa chain region mouse long form precursor short form precursor precursor brain large precursor kappa chain region human kappa chain human chain kappa chain precursor region human heavy chain precursor mouse chain region mouse heavy chain region mouse kappa chain region precursor human kappa chain precursor human heavy chain region mouse cell precursor mouse chain region mouse epsilon chain region human chain region human kappa chain precursor region mouse kappa chain precursor region bengio bengio table efficiency detection present scores recognized domains protein type listed recognition efficiency dividing number proteins correctly identified bearing domain total number proteins identified file description domain multiplied numbers parentheses number complete protein sequences type species complete sequences light heavy chains human mouse origin scanned threshold protein mouse forms score detected domains recognition efficiency proteins human forms class forms class forms tcell receptor chains mouse forms tcell receptor chains human forms vast majority proteins scored human mouse rabbit origin insect proteins scored threshold proteins training present protein databases detected proteins detected database listed table sorted score human class included training mouse class detected detected proteins human proteins include domain arranged domains detected sufficiently served domains lacking feature degenerate feature scored lower recog threshold recognition human mouse sequences measure recognition efficiency rate false species table table lists proteins categorized false positives detected searching threshold relative total number domains detected corresponds false positive rate strict sense proteins false positives exhibit features domain correct order neural network detect proteins distances observed domains proteins rich sodium channel chain false positives surprising domain composed solution problem lies larger training addition intelligent stage designed evaluate distances increase city detection table false positives obtained searching threshold proteins categorized false positives listed text details transforming protein chain chain mouse hypothetical protein strain sodium channel protein protein mouse disease protein human precursor human protein protein precursor human discussion detection specific protein domains increasingly important proteins succession domains domains weakly designed neural network detect proteins comprise domains evaluate approach solve problem alter neural search programs exist search programs designed recognize flanking regions domain features conserved features domains wang exhibit poor generate statistically insignificant scores analyzed align program williams search programs profile analysis handle large variations domain size exhibited domain comprised search results high rates false positives size protein databases increases considerably year problem false positives rates substantially decreased view problems found application neural network detection domains advantageous solution state biological knowledge advances domains added training training learn statistical features bengio bengio conserved permit detection domain examples domain similar distribution previ ously possibly degenerate sequences fore detected acknowledgments research supported grant canadian natural sciences engineering research council lowing access experimental references bengio mori programmable execution multilayered networks automatic speech recognition communications association computing machinery bengio mori speaker independent speech recognition neural networks speech knowledge touretzky advances neural networks information processing systems holland identification served region common strain biology press comprehensive sequence analysis programs acids profile analysis detection related proteins proc natl acad general method applicable search similarities amino acid sequence proteins biol sejnowski predicting secondary structure proteins neural network models biol rumelhart hinton williams learning internal representation error propagation parallel distributed processing press cambridge smith identification common subsequences biol schneider gold ehrenfeucht perceptron algorithm distinguish translational sites acids wang tang expands nature rapid similarity searches acids protein data banks proc natl acad williams domains cell surface recognition
3 bumptrees efficient function constraint classification stephen omohundro international computer science institute center suite berkeley california abstract class data structures called bumptrees structures efficiently implementing number neural network related operations empirical comparison radial basis functions presented robot mapping learning task applica tions density estimation classification constraint representation learning outlined bumptree bumptree geometric data structure efficiently learning evaluating geometric relationships variety contexts natural generalization hierarchical geometric data structures including trees geometric learning tasks including approximating functions constraint surfaces classification regions probability ties samples function approximation case approach related radial basis function neural networks supports faster construction faster access flexible modification provide empirical data comparing bumptrees radial basis functions section bumptree provide efficient access collection functions euclidean space interest complete binary tree leaf corresponds function interest functions internal node defining constraint interior nodes function larger omohundro functions leaves cases leaf functions peaked regions origin simple kind bump func tion symmetric center vanishes ball figure shows structure twodimensional bumptree setting ball supported bump leaf functions tree structure tree functions figure twodimensional bumptree important special case bumptrees access collections gaussian functions multidimensional spaces collections repre smooth probability distribution functions gaussian mixture arises adaptive kernel estimation schemes convenient represent quadratic exponents gaussians tree gaussians simplest approach quadratic functions internal nodes leaves shown figure classes internal node functions provide faster access figure bumptree holding gaussians hierarchical geometric data structures special cases bumptrees choosing internal node functions shown figure regions represented functions inside region vanish function shown figure aligned coordinate axis stant side decreases quadratically side coordinate location constant situations coefficient quadratic decrease function ated extremely efficiently data point fast pruning operations evaluations effectively implement fast nearest neigh computation bumptree structure generalizes kind query differ scales points directions empirical results presented section based bumptrees kind internal node function bumptrees efficient function constraint classification learning figure internal bump functions omohundro omohundro higher performance approaches choosing tree structure build leaf data algorithms studied construction omohundro applied general task bumptree construction fastest approach analogous basic tree construction technique friedman recur splits functions sets size simulations section effective approach builds tree bottom deciding pair functions single node intermediate speed quality incremental approaches dynamically insert delete leaf functions bumptrees efficiently support important queries simplest kind query presents point space leaf functions point larger bumptree search root prune subtrees root function smaller point interesting queries based branch bound generalize nearest neigh queries trees support typical case collection gaussians request gaussians point factor gaussian largest point search proceeds promising branches continually maintains largest found point subtrees factor current largest function robot mapping learning task kinematic space visual space figure robot mapping task omohundro figure shows setup defines mapping learning task study data structure setup investigated extensively involves camera robot kinematic state angle control coordinates visual state visual coordinates spots mapping kinematic space nonlinear dimensions system attempts learn mapping observing state variety randomly chosen kinematic states random inputoutput pairs system generalize ping inputs mapping task chosen fairly representative typical problems arising vision robotics radial basis function approach mapping learning represent function linear combination functions symmetric chosen centers simplest form basis functions centered input points recent variations fewer basis functions sample points choose centers clustering timing results terms number basis functions number sample points variation type forms basis functions suggested study gaussian linearly increasing functions gave similar results coefficients basis functions chosen forms squares data fits require time proportional cube number parameters general periments reported singular decomposition compute coefficients approach mapping learning based bumptrees builds local models mapping region space data training samples nearest region local models combined convex ence functions model influence function peaked region salient bumptree structure local models models great influence query sample eval influence functions vanish compact region tree prune branches influence models influence distance branch bound technique determine contributions greater error bound bump functions point region interest called partition unity form influence bumps dividing smooth bumps gaussians smooth bumps vanish sphere form easily computed unity local models affine functions determined squares local samples combined partition unity point convex combination local model values error full model bounded errors local models full approxi smooth local bump functions results give precise bounds average number samples needed achieve approximation error functions bounded derivative approach linear fits small local samples avoiding computationally expensive fits data required radial basis functions locality easily update model online data arrives bumptrees efficient function constraint classification learning bump functions gaussians forms partition unity local affine models final smoothly interpolated approx function influence bumps centered sample points width determined sample density affine model influence bump determined weighted squares sample points nearest bump center weight decreases distance performs global number samples points radial basis func tion approach achieves smaller error approach based bumptrees terms construction time achieve error bumptrees clear shows square error robot mapping task decreases function time construct mapping square error learning time secs figure square error function learning time important applications learning time retrieval time retrieval radial basis functions requires basis function computed query input results combined weight matrix time increases linearly function number basis functions represen tation bumptree approach influence bumps affine models pruned bumptree retrieval perform computation input shows retrieval time function number training samples robot ping task retrieval time radial basis functions crosses samples increases linearly graph algorithm retrieval time empirically grows slowly doesnt require time samples represented shown representation improved size generalization capacity merging technique idea merging local models influence bumps single model pair increases error omohundro merged process repeated pair left exceed error criterion algorithm good discovering representing linear parts single model higher resolution models areas strong nonlinearities retrieval time secs gaussian figure retrieval time function number training samples extensions tasks bumptree structure implementing efficient versions variety geometric learning tasks omohundro fundamental task density estimation attempts model probability distribution space samples drawn distribution powerful technique kernel estimated distribution represented gaussian mixture symmetric gaussian centered data point widths chosen local density samples bestfirst merging technique produce mixtures consisting fewer nonsymmetric bumptree find gaussians internal node functions include faster evaluate functions shown figure efficiently perform operations probability densities represented basic query return density location bumptree branch bound achieve retrieval logarithmic expected time quickly find marginal probabilities integrating dimensions tree quickly identify gaussian contribute conditional distribu tions represented form bumptrees compose distributions discussed mapping learning evaluation situations natural input output variables required mapping probability distribution peaked lower dimensional surface thought constraint networks bumptrees efficient function constraint classification learning constraints imposed order variables natural describing problems bumptrees open possibilities efficiently representing propagating smooth constraints continuous variables basic query external constraints variables network impose constraints multidimensional product gaussians represent joint ranges variables operation imposing constraint surface thought multiplying external constraint gaussian function representing constraint distribution product gaussians gaussian operation produces gaussian mixtures bumptrees facilitate operation mappings constructs surfaces local affine patches weighted influence functions developed local analog principle components analysis builds surfaces random drawn mapping structures bestfirst merging operation discover affine structure constraint surface finally bumptrees enhance performance classifiers approach directly implement bayes classifiers adaptive kernel density estimator scribed distribution function bumptree class sophisticated branch bound single tree classes summary bumptrees natural generalization hierarchical geometric cess structures enhance performance neural network algorithms compared radial basis functions mapping learning technique bumptrees boost retrieval performance radial basis func tions directly basis functions decay centers neural network approaches network perform work query susceptible dramatic kind access structure references nonparametric density estimation view york wiley friedman finkel algorithm finding match logarithmic expected time trans math software connectionist robot motion planning approach reaching diego academic press omohundro efficient algorithms neural network behavior complex systems omohundro algorithms international computer science institute technical report omohundro geometric learning algorithms physica nearestneighbor searching trees associates technical report
4 neural network analysis event related potentials predicts vigilance william terrence sejnowski computational neurobiology laboratory salk institute jolla abstract automated monitoring vigilance attention intensive tasks control sonar operation highly desirable opera monitor operator step goal feedforward neural networks trained backpropagation interpret event related potentials ated periods high vigilance accuracy system data averaged minutes accuracy obtained linear discriminant analysis practical vigilance monitoring require prediction shorter time periods average minutes correct prediction vigilance measure additionally achieved similarly good performance segments power spectrum short introduction tasks society demand sustained attention varying stimuli long period time detection failure vigilance tasks enormous physiological variables sejnowski heart rate pulse correlate extent level attention appearance spectrum sleep agreement bands predict vigilance recent studies strong correlation power spectra frequencies attentional level subjects performing sustained task measure widely assessed context involves eventrelated potentials voltage ongoing time locked sensory motor cognitive events small recognized background electrical activity erps signal typically extracted background noise consequence averaging trials waveform remains constant repetition event background activity random amplitude late cognitive eventrelated potentials related attentional allocation erps evoked subject stimulus condition present monitoring situation monitoring precisely time stimulus occurrence unknown shorter latency responses evoked signals evaluated data sonar simulation task obtained presented auditory targets slightly background noise male united states tones subjects instructed ignore appeared randomly seconds task irrelevant back ground collected analyzed erps evoked task irrelevant classified groups depending appeared correctly identified target erps missed target erps erps showed relative increase compo nents decrease sign time peak components prior linear discriminant analysis performed averages session showed correct classification erps obtained single scalp site erps permit classification averaging large sample addition power spectra frequency bands computed clas made basis continuous measure performance error rate calculated hits moving window power spectrum revealed significant coherence observed frequencies performance method data groups input data erps msec sample task irrelevant probe reduced points lowpass filtering normalized data basis maximum minimum values entire maintaining amplitude variability single classified based subjects performance target tone power spectrum obtained seconds input analysis event related potentials predicts vigilance predict continuous estimate vigilance error rate obtained averaging subjects performance window normalized frequencies previously shown strongly related error rate frequency individually normalized range network feedforward networks trained backpropagation compared twolayer network threelayer networks varying number hidden units simulations architecture trained times task weights time random seed initial simula tions performed select network parameter values learning rate divided fanin weight initialization range data jackknife procedure simulation single pattern excluded training considered test pattern pattern turn removed test pattern training data limited simulations performed half data training remaining part testing subjects runs training testing data separate sessions results erps simulation twolayer network assess neural network approach relative previous results data consisted averages single scalp site subjects double session giving total patterns jackknife procedure ways considered individually study erps single subject grouped removed form test network trained epochs testing figure shows weights networks trained erps obtained removing single waveform weight values corresponds features common erps negative features common erps classification patterns network considerably accurate correct obtained previous analysis correct evaluation networks started random weight remaining networks produced correct responses patterns missed cases hidden units improve generalization subject jackknife results similar correct networks remaining increased difficulty generalizing individuals ability network generalize shorter period time tested progressively decreasing number trials testing network trained average erps sejnowski figure weights twolayer networks trained initial weights correspond sample point time input data correct classification hidden units correct classification hidden units figure generalization performance pattern left subject jack twolayer threelayer networks number hidden units represents random start network analysis event related potentials predicts vigilance correct classification figure individual erps total number generalization testing made varying number formed individual erps figure performance single chance erps minutes accuracy obtained report results twolayer network compare previous analysis power spectrum frequency bands single scalp site input data error rate averaged seconds intervals runs error rate power spectra filtered minute time window good results obtained cases subject made errors time subject made errors training difficult generalization poor results virtually identical lack improvement fact performance close data threelayer networks improve generalization performance running average includes information time network making prediction causal prediction attempted multiple power spectra intervals past predict error rate results subject shown figure predicted error rate differs target root square error sejnowski time figure generalization results predicting error rate dotted line network output solid line desired time figure causal prediction error rate dotted line network output solid line desired analysis event related potentials predicts vigilance figure weights twolayer causal prediction network frequency band represents influence output unit power band previous times ranging left figure shows weights twolayer network trained predict error rate network information frequency bands predicting error rate values weights strong peak recent time steps indicating power frequency band predicts state vigilance short time scale alternating positive negative weights present suggest rapid power band predictive vigilance derivative power signal discussion results neural networks analyzing physiological measures results suggest analysis applied detect fluctuations attentional level subjects real time analysis tool understanding occur electric activity brain states attention analysis lack improvement introduction hidden units small size data data small adding hidden units connections reduce ability find general solution problem results point gener alization suggests possibility network multiple subjects training network individual results suggest detection sejnowski time interval erps completion analysis order obtain line detector attentional future research combination measures heart rate idea model choose network architectures rameters depending specific subtask scott makeig mark cognitive performance department naval health research center diego data invaluable discussions bottou provided simulator supported ministry instruction italy award national aging investigator howard hughes medical institute research supported grant references wright electrical activity brain vigilance clinical extreme spectral parameters neuroscience peter neurophysiological vigilance indicators operational analysis train vigilance monitoring device laboratory field study vigilance theory operational performance physiological correlates york plenum press makeig alertness coherence fluctuations performance spectrum cognitive performance department diego technical report human vigilance auditory evoked responses clinical neurophysiology habituation auditory stimuli task difficulty probability interval deter auditory stimuli clinical neuro physiology probability interval makeig predicting vigilance brain evoked responses irrelevant auditory probe cognitive performance department diego technical port
11 dynamically adapting kernels support vector machines dept engineering mathematics university campbell dept engineering mathematics university john shawetaylor dept computer science royal college abstract tunable parameters port vector machines controlling complexity resulting hypothesis choice amounts model selection found means validation present algo rithm automatically perform model selection additional computational cost validation procedure model selection learning separate kernels dynamically adjusted learning process find kernel parameter upper bound generalisation error theoretical results approach experimental results validity presented introduction support vector machines svms learning systems designed automatically tradeoff accuracy complexity minimizing upper bound general error provided theory practice svms tunable parameters determined order achieve ance values found means validation important implicitly defines structure high dimensional feature space maximal margin hyper plane found rich feature space system overfit data dynamically adapting kernels support vector machines conversely system unable separate data kernels poor capacity control performed tuning kernel parameter subject margin maximized noisy datasets quantity parameter svms display remarkable dimensionality reduction model selection systems neural networks architectures tested decision trees faced similar problem pruning phase hand svms shift model complexity simply tuning continuous parameter generally model selection svms performed standard learning svms testing validation order determine optimal expensive terms computing time training data paper propose scheme adjusts explore space models additional computational cost compared learning approach makes information efficient sample complexity sense model selection procedure prove theoretical result margin structural risk minimization bound eralization error depend smoothly kernel parameter exploited algorithm system close maximal margin kernel parameter changed smoothly phase theoretical bound theory computed lowest bound section present experimental results showing model selection efficiently performed proposed method gaussian kernels simulations outlined support vector learning decision function implemented machines written obtained maximising lagrangian number patterns respect subject constraints functions called kernels kernels provide sion highdimensional feature space campbell shawetaylor implicitly define nonlinear mapping training data feature space separated maximal margin hyperplane number choices made gaussians kernels upper bound proven theory generalisation error hyperplanes feature space radius smallest ball training number training points margin complete survey generaliza tion properties machines lagrange multipliers found means quadratic program ming optimization routine found validation illustrated figure minimum generalisation error tradeoff overfitting ability find efficient solution figure generalization error yaxis function xaxis mirror problem gaussian kernels training error maximal margin averaged examples automatic model order selection prove theorem shows margin optimal hyperplane smooth function kernel parameter upper bound generalisation error state implicit function theorem implicit function theorem continuously differentiable function solution equation partial derivatives matrix full rank dynamically adapting kernels support vector machines exists function function continuous theorem machines depends smoothly kernel parameter proof function data maps choice optimal parameters lagrange parameter machine kernel matrix functional machine maximizes solution indices assume indexed nonsingular maximal margin hyperplane expressed terms subset indices choose indices nonsingular points indexed margin function neighbourhood elements satisfies equation points function implicit function continuous unique continuously differentiable partial derivatives matrix full rank partial derivatives matrix definition nonzero satisfying matrix nonsingular satisfying linear constraint implicit function theorem continuous function proven shows continuous function radius ball points continuous function generalization error bound form constant corollary corollary bound generalization error smooth means margin optimal small variations kernel rameter produce small variations margin bound generalisation error updating system campbell shawetaylor suboptimal position suggests strategy gaussian kernels instance kernel selection procedure initialize small maximize margin compute bound observe validation error increase kernel parameter stop predetermined reached repeat step procedure takes advantage fact small convergence generally rapid overfitting data system equilibrium iterations sufficient move back maximal margin situation words system brought maximal margin state beginning computationally cheap actively situation continuously adjusting kernel parameter gradually increased section experimentally investigate procedure datasets numerical simulations algorithm recently developed authors train machines chosen algorithm regarded gradient ascent procedure maximising lagrangian suboptimal state close optimum computational effort needed bring system back maximal margin position algorithm margin positively labelled patterns stop step experimental results section implement algorithm datasets plot upper bound theory generalization error functions order compute bound estimate radius ball feature space general explicitly maximising lagrangian convex quadratic programming routines subject constraints radius found dynamically adapting kernels support vector machines upper bound quantity noting gaussian kernels training points surface sphere radius centered origin feature space easily noting distance point origin norm figure give bounds upper bound general error test standard datasets dependent sonar classification dataset sejnowski wisconsin breast cancer dataset plots addi tional computational cost determining quadratic problem gaussian kernels plot bound generalisation error figures united states postal service dataset handwritten digits instances investigated mini bound approximately coincides minimum generalisation error good criterion suitable choice estimate derived solely training data additional validation figure generalisation error solid curves sonar classification left wisconsin breast cancer datasets upper curves dotted show upper bounds theory curves starting small observed margin rapidly margin remains close incremented small amount study performance system traversing range alternately maximising margin previous optimal starting point found procedure significant computational cost general sonar classification dataset mentioned starting increments iterations reach reach iterations learning rough doubling learning time determine reasonable good generalisation validation campbell shawetaylor figure generalisation error solid curve upper bound theory dashed curve digits usps dataset handwritten digits conclusion presented algorithm automatically learns kernel parameter additional cost computational sense model selection takes place learning process experimental results provided showing strategy good estimate correct model complexity references theoretical foundations potential function method pattern recognition learning remote control bartlett generalization performance support vector machines pattern classifiers advances kernel methods support vector learning bernhard christopher burges alexander smola press cambridge burges tutorial support vector machines pattern recognition data mining knowledge discovery campbell algorithm fast simple learning procedure support vector machines shavlik chine learning proceedings international conference morgan mann publishers francisco sejnowski neural networks lecun jackel bottou cortes denker drucker guyon muller simard vapnik comparison learning algorithms handwritten digit recognition international conference artificial neural networks bartlett williamson anthony structural risk minimization datadependent hierarchies technical report neural networks medical diagnosis comparison methods proceedings international conference vapnik nature statistical learning theory springer verlag james robert advanced calculus belmont wadsworth
7 solvable connectionist model recall ordered lists burgess department anatomy university college london london england email abstract model shortterm memory serially ordered lists stimuli proposed implementation articulatory loop thought mediate type memory model predicts presence timevarying context signal coding timing items presentation addition store phonological information process serial items context nodes phonemes hebbian connections showing short long term plasticity items activated phonemic input presentation context phonemic feedback output serial selection items occurs winnertakeall interaction items winner subsequently receiving decaying inhibition approximate analysis error probabilities gaussian noise output presented model account probability error function serial position list length word length phonemic similarity temporal grouping item list familiarity proposed starting point model vocabulary acquisition introduction shortterm memory serially ordered lists stimuli scribed crude level idea articulatory loop information encoded decays seconds serial successfully accounts burgess linear relationship memory span number items lists items correctly recalled articulation rate number items varies function items language development fact span lower lists similar items distinct speech articulatory distractor tasks reduce memory span recent evidence suggests plays role learning words development recovery brain emission studies phonological store left vocal involves area motor areas involved speech planning production detail types errors addressed idea majority errors order errors item errors tend involve neighbouring similar items probability correctly recalling list function list length probability correctly recalling item function serial position list serial position curve shape span increases familiarity items specifically increase increases list previously presented hebb effect position specific occur item previous list recalled position current list data impose strong functional constraints neural mechanism implementing models showing serial behaviour rely form chaining associates previous states successive states recurrent nections types chaining item phoneme representations gener errors human data burgess items maintained serial order association timevarying signal suggested position specific referred context recovery suppression involved selection process modification competitive model speech production characteristics serially ordered items arise context phoneme information selection item model model consists layers artificial neurons representing context phonemes items connected hebbian connections long short term plasticity winnertakeall interaction item nodes time step item greatest input activation winner time step receives decaying inhibition prevents selected presentation phoneme nodes activated acoustic translated visual input activation context layer pattern shown item nodes receive input phoneme nodes connections connections connectionist model recall ordered lists context items suppression output phonemes translated visual input acoustic input buffer figure context states function serial position filled circles active nodes empty circles inactive nodes architecture model full lines connections short long term plasticity dashed lines routes information enters model learn association context state winning item learn association active phonemes recall context layer presentation activation spreads item layer item wins activates phonemes item wins context phoneme inputs output suppressed model makes errors errors occur gaussian noise added items activations selection winning item output errors items similar activation levels decay connection weights inhibition presentation items selected wrong order performance decrease time present recall list learning familiarity connection weights long short term plasticity similarly incremental long term component shot short term component decays factor weight connection components learning occurs burgess activations decreases long term component saturates connection weights negative items familiarity reflected size long term components weights storing association phonemes components increase presentation recall item lists unfamiliar items item nodes completely shortterm connections phoneme nodes learned presentation presentation familiar item leads selection item node weights output item activate phonemes strongly weights unfamiliar items phone similar familiar item tend represented familiar item node advantage longterm weights presentation list leads increase long term component context item association list presented recall improves position specific previous lists occur notice weights item winning presentation output increased details items list phonemes item seconds present recall time item node activation context node activation context nodes active time phoneme node activation phonemes comprising item context nodes activation phonemes activation sets relative effect context phoneme layers ensures items phonemes burgess longterm components weights familiar items unfamiliar items chosen match data longterm components weights increase presentations list interaction item node input decaying inhibition imposed items selection presentation output gaussian random variable added output excitatory input item phoneme layer presentation context phoneme layers recall presentation recall recall phoneme nodes activated solvable connectionist model recall ordered lists time step refers presentation recall item duration variable increases time step refers time serial position short term connection weights inhibition decay factor time step algorithm corresponds repeating recall phase presentation activations short term weights context layer state input items phoneme layer state select winning item learning increment connection weights decay multiply shortterm connection weights factor inhibit winner item selected recall context layer state phoneme activations select winning item output activate phonemes select winning item presence noise learning decay inhibit winner item selected analysis output model averaged trials depends activation values items output step time activations noise level probability item winner estimation simple exact expression depends items output prior time define time output time item selected presentation output absence errors prior errors time inhibition item short term weights item decayed factor list familiar items excitatory input item output time burgess figure serial position curves full lines show estimation extra markers error bars standard deviation simulations trials parameter values consecutive list digits phonemic similarity shown lists dissimilar letters similar letters alternating similar dissimilar letters similar positions number elements probability item wins time estimated softmax noise term difference estimated simulation trials items items selected prior time affects estimated combinations prior errors values average weighted probability error combination missing probability prior errors corrected effect lists performance parameter values types item modelled varying digits correspond letters words similar items phoneme common dissimilar items items dissimilar familiar familiarity modelled size relative digit span approximately digits seconds models performance shown figs increase longterm component connections brings stability small number errors serial position curves show correct effect phonemic similarity items solvable connectionist model recall ordered lists figure item span full lines show estimation extra markers error bars standard deviation simulations trials parameter values probability correctly recalling list versus list length lists digits unfamiliar items length experimental data digits adapted shown span versus articulation rate calculated curves shown lists familiar unfamiliar words lists familiar words repetitions data recall words nonwords shown adapted probability recalling list correctly function list length shows correct sigmoidal relationship item span shows correct approximately linear relationship articulation rate span unfamiliar items familiar items span increases repeated presentations list accordance hebb effect note span slightly short lists long words discussion relation previous work model extension burgess primarily model effects item list familiarity allowing connection weights show plasticity show phonemic similarity effects simultaneously changing phoneme nodes activated recall note context timing signal varies serial position rhythm presentation absolute time effect temporal grouping modelled modifying context representations reflect presence pauses presentation tion recall rates varied decaying inhibition items selection increases locality errors item replaces item item replace item turn item model remaining problems selecting item node form long term representation item taking existing item nodes learning correct order phonemes item extension address problem presented mechanism selecting items modification competitive burgess interaction occurs item layer extra layer winner active context phoneme nodes avoids partial associations context state items similar winner prevent curves basic selection mechanism sufficient store serial order items recover suppression order selected presentation model maps articulatory loop idea selection mechanism corre sponds part speech production articulation system phoneme layer corresponds phonological store predicts context timing signal present phoneme context inputs item layer serve increase span addition phonemic similarity effects position specific temporal grouping effects conclusion proposed simple mechanism storage recall serially ordered lists items distribution errors predicted model estimated mathematically models wide variety experimental data virtue long short term plasticity connection weights model begins address familiarity role vocabulary acquisition predicted error probabilities checked experimentally predictions major prediction model burgess addition shortterm store phonological information process ordered lists items involves component timevarying signal reflecting rhythm items presentation acknowledgements grateful discussions graham data george error probabilities mike page suggesting softmax function work supported royal society university research fellowship references psychology working memory press touretzky advances neural information processing systems mateo morgan kaufmann burgess memory language working memory language erlbaum american psychology memory language published tech report applied psychology unit cambridge burgess psychology submitted current research natural language generation london academic press brown memory language nature part neuroscience
4 information processing create movements david robinson departments biomedical engineering johns hopkins university school medicine baltimore abstract muscles deal external loads write equation relates motoneuron firing rate position velocity situation canals head velocity linear manner high background discharge rate linearity circuits generate movements allowed signal processing involved including neural network integrates ideas summarized block diagrams describing behavior single neurons finding supported neural network models introduction neural networks studies simple differ applications attempt model real neural oculomotor system extensively studied extent neural networks succeed describing behavior hidden units major benefit neural networks oculomotor system illustrate shortcomings block diagram models expect inside boxes conversely single unit behavior loosely coupled system behavior simplicity oculomotor system relationships understood complicated system behavior single hidden units give robinson information system mind simplifications oculomotor control impossible muscles load varies write equation uniquely relates discharge rates motoneurons position load position case limb muscles system firstorder linear differential equation linearity design canals origin vestibuloocular reflex reflex creates movements compensate head movements stabilize eyes space clear vision canals primarily head velocity neurally encoded discharge rates afferents rates modulate high background rate typically cutoff wide linear range core reflex neurons long canals impose properties linear modulation high background rate neurons including motoneurons addition linearity functions oculomotor subsystems clear stretch reflex muscle fibers straight parallel joint features combine understand organization oculomotor signals caudal system modelling neural network modelling distribution oculomotor signals application neural networks oculomotor system study anastasio robinson problem addressed concerned convergence diverse oculomotor signals caudal major oculomotor subsystems saccadic system eyes jump rapidly target smooth pursuit system eyes track moving target appears caudal velocity command canals vestibular nuclei provide command compensatory vestibular movements burst neurons nearby reticular formation provide signal desired velocity saccade purkinje cells cerebellum carry signal pursuit movements commands converge region motoneurons records cells region discharge rate high background rate previously coefficients assume values seemingly random neuron robinson model show velocity commands converging suggest existence neurons carrying complicated signals hand behavior nice biological signals exist signals information processing create movements distributed correct amount simple specific distributed parallel processing nervous system neural network model explicit statement distribution initial synaptic weights learning creates hidden units concluded neural network model neural system exercise brought home simple obvious message models conceptual functions realized neurons examined distribution spatial properties interneurons anastasio robinson vertical things simple inputs primary afferents vertical canals sense head rotations combinations pitch roll output layer vertical muscles move vertically model trained perform compensatory movements combinations pitch roll sensitivity axis axis rotation head produces maximum modulation discharge rate sensitivity axis canal unit perpendicular plane canal lies motoneuron axis muscle rotate sensitivity axes hidden units block diagram spatial manipulations consists matrices geometry canals matrix converts vector neurally encoded representation canal geometry muscles matrix converts motoneuron vector physical vector brainstem matrix describes canal neurons project motoneurons robinson scheme interneurons fixed sensitivity axes canal unit motoneuron model sensitivity axes distributed network hidden units point variety directions confirmed recordings fukushima spatial aspects transformations temporal aspects distributed interneurons case form matrix find recording single units tells network talk motor physiology coordinate systems transformations question asked coordinate system neuron working individual hidden units behave coordinate system raises problem meaningful question neural integrator muscles largely position actuators constant load position proportional robinson motoneurons muscles signal proportional desired position velocity commands enter caudal commands command obtained integrating velocity signals robinson review location neural network discovered caudal work networks based positive feedback proposed utilizing lateral inhibition recently learning neural network dynamic proposed robinson hidden units freely connected input canal units output motoneurons operate plant transfer function plant time constant create position time integral input head velocity error retinal image slip difference actual ideal velocity trial interval change weights steepest descent method error negligible compensate plant network produce combination output velocity integral position signals weights hidden units remarkably integrator neurons record exercise raises issues model network marked parallel direct velocity feedforward path gain parallel combination pole plant leaving position perfect integral head velocity diagram conceptually disorders robinson hint neurons effect integration useless regard pointed complex context direct feedforward path gain positive feedback path model plant produces transfer function correct feedforward feedback note neural network model integrator feedback feedforward pathways relies positive feedback network block diagrams making questions correct irrelevant thing level organization close neuron level conceptual useless interested describing real neural networks finally test model network proposed neural integrator involves small sets cells talk process signals technology answer question real successful examples true neurophysiology solve content describe cell groups acknowledgements research supported grant national institute national health information processing create movements references anastasio robinson distributed representation ocular signals brainstem neurons biol cybern anastasio robinson distributed parallel processing vertical vestibuloocular reflex learning networks compared tensor theory biol cybern robinson learning network model neural integrator oculomotor system biol cybern robinson proposed neural network integrator oculomotor system biol cybern fukushima baker peterson spatial properties secondorder vestibuloocular neurons alert brain model central neural pathways vestibuloocular reflex neurophysiol robinson matrices analyzing threedimensional behavior vestibuloocular reflex biol cybern robinson integrating neurons neurosci robinson signals vestibular nucleus mediating vertical movements monkey neurophysiol robinson clinical applications models thompson topics baltimore williams
2 discovering structure reactive environment exploration discovering structure reactive environment exploration michael mozer computer science institute cognitive science university colorado boulder jonathan department computer information science university massachusetts amherst abstract robot unfamiliar performing tions sensing resulting environmental states robots task internal model environment model predict consequences actions sequences actions reach goal states rivest schapire schapire studied problem designed symbolic algo rithm explore infer structure finite state environ ments heart algorithm representation environment called update graph developed connectionist implementation update graph network architecture back propagation learning exploration strategy choosing random tions network outperform rivest schapire gorithm simple problems network additional strength accommodate stochastic environments greatest virtue suggests generalizations update graph representation arise traditional symbolic perspective introduction robot unfamiliar environment robot allowed environment actions sensing resulting environmental states sufficient exploration robot internal model environment model predict consequences tions determine sequence actions reach goal state paper describe network task based representation finitestate automata developed rivest schapire mozer schapire environments modeled finitestate automaton environment robot discrete actions execute move environmental state environmental state detected robot illustrate concepts methods work extended simple environment world rivest schapire world consists rooms arranged circular chain room connected adjacent rooms room light bulb light switch robot sense light room stands robot actions move room chain move room chain light switch current room modeling environment world sensory consequences sequence actions predicted determine sequence actions obtain goal state developing algorithm learn directly arguments schapire important capture ture inherent environment learn rivest schapire suggest learning representation environment called update graph advantage update graph environments regularities number nodes update graph smaller versus world rivest formal definition update graph based notion tests performed environment equivalence tests section present alternative intuitive view update graph facilitates connectionist interpretation world model environment essential knowledge quired status lights current room room current room room current room update graph node environmental variables assume node indicating light room values variables current environmental state values taking action robot moves room previous previous world previous depicted figure action results shifting values nodes makes sense moving affect status light alter robots position respect rooms figure shows analogous flow information action finally action status current rooms light rooms remain unaffected figure figure sets links figures superimposed labeled action final detail rivest schapire update graph formalism make link avoid node split values discovering structure reactive environment exploration representing status room complement figure involves values values shifted actions update graph figure node current environ mental state result sequence actions predicted simply shifting values graph predicting inputoutput behavior concerned update graph serves purpose defining current description property update graph node incoming link action call constraint input action figure links nodes indicating desired information flow performing action represents status current room status room status room links nodes indicating desired information flow performing action links nodes indicating desired information flow performing action links separate actions superimposed labeled action link avoided adding nodes represent update graph world mozer rivest schapire algorithm rivest schapire developed symbolic algorithm algorithm explore environment learn update graph representation algorithm explicit hypotheses regularities environment tests hypotheses small number time result algorithm make full environmental feedback obtained worthwhile alternative approaches efficient feedback efficient learning update graph approach shown promising results preliminary experiments suggests significant benefits detail benefits describe basic approach update graph connectionist network turn update graph connectionist network start assuming unit network node update graph activity level unit represents truth update graph node units serve outputs network world output network unit represents status current room environ ments case output units analog labeled links update graph labels values link action occurs terms links gated action elaborate include units represent actions units gate flow activity units update graph action performed action unit activated connections gated action enabled action units form local representation active time connections enabled time gated replaced weight matrices action predict consequences action weight matrix action network activity allowed propagate connections network dynamically current action effect activity propagation activity unit previous activity unit linear activation function sufficient achieve action selected time weight matrix action activity vector results taking action assuming weight matrices connection strength constraint activation rule activity values copied network training network update graph connectionist network behave update graph turn procedure learn connection strengths network tory purposes assume number units update graph advance discovering structure reactive environment exploration show mozer weight matrix quired action potential nonzero connection pair units connectionist learning procedures weight matrices initialized values outcome learning matrices represent update graph connectivity network behave update graph constraint satisfied terms connectivity matrices means weight matrix connection strengths achieve property additional constraints weights combination constraints connection strength action constraint satisfied introducing additional cost term error function constraints enforced weight update normalization procedure finds shortest distance projection updated weight vector satisfies constraint time step training procedure consists sequence events action selected random weight matrix action compute activities previous activities selected action performed environment resulting observed observed compared predicted network activities units chosen represent compute measure error error added contribution back propagation procedure rumelhart hinton williams compute derivative error respect weights current earlier time steps weight matrices action updated error gradient enforce constraints temporal record unit activities maintained permit back propagation time updated reflect weights explanation activities output units time represent predicted replaced actual observed steps require error measured time correct propagation activities time call modification weight matrix error attributed incorrect propagation earlier times back propagation assign weights earlier times critical parameter training amount temporal history found problem error propagation mozer rain critical number steps improve learning performance fewer performance results problem appeared safe limit number nodes date graph solution problem back propagate error time maintain temporal record unit activities problem arises activities weight update activities longer consistent weights equation violated error derivatives computed back propagation exact equation satisfied future weight updates based inconsistent activities correct empirically found algorithm extremely unstable address problem situations back propagation applied sequences sequences finite length wait sequence update weights point consistency activities weights longer matters system starts beginning quence present situation sequence actions terminate forced alternative means ensuring consistency successful approach involved updating activities weight change force consistency step list propagated earliest activities temporal record forward time updated weight matrices results figure shows weights update graph network world robot steps figure depicts connectivity pattern identical update graph figure explain correspondence shape person head left arms left legs heart action head output unit receives input left left heart heart head forming loop units left form iiiii figure weights exploratory steps world large diagram represents weights actions small diagram contained large diagram represents strengths feeding unit action units small diagrams output unit state current room head large white square position small represents strength unit position large unit represented small area square connection discovering structure reactive environment exploration similar loop action loops present reverse direc tion loops figure action left arms heart left current head change values corresponds exchange values nodes figure addition learning update graph connectivity network simultaneously learned correct activity values node current state environment information network predict outcome sequence actions prediction error drops causing learning network completely stable news network converge random initial weights requires order steps weight constraints removed network converges fail steps mozer weight constraints suggest weight constraints resulting weight collection positive negative weights varying magnitudes readily interpreted case world reason final weights difficult interpret discovered solution satisfy update graph formalism discovered notion links sort shown figure links units required unnecessary units solution encode information table compares performance algorithm network weight constraints environments performance measured terms median number actions robot predict outcome subsequent actions details experiments found mozer simple environments update graph outperform algorithm result surprising tion sequence train network generated random contrast algo rithm involves strategy exploring environment conjecture network considers updates hypotheses parallel time step complex environments network poorly complex number nodes update graph large number distinguishing environmental small network failed learn world algorithm succeeded intelligent exploration strategy case random long search state space direction future work potential offered connectionist learning algorithms connec tionist approach benefits table number steps required learn update graph connectionist environment algorithm update graph world radio world world world fails mozer performance network appears insensitive prior knowledge number nodes update graph learned contrast algorithm requires upper bound update graph complexity performance degrades significantly upper bound tight network accommodate noisy environments contrast algorithm learning network continually makes predictions result action predictions improve experience algorithm make predictions learning complete modified cost treating update graph matrices connection strengths suggested update graph formalism dont arise traditional analysis fairly direct extension allowing links connectionist network linear system linear weight matrices produce equivalent system local connectivity update graph mozer linearity network tools linear algebra analyze resulting connectivity matrices benefits approach problem study claim proach impressive work rivest schapire offers strengths alternative learning problem acknowledgements schapire paul smolensky rich sutton helpful discussions work supported grant james mcdonnell foundation michael mozer grant sloan foundation geoffrey hinton grant force office scientific research andrew barto references mozer discovering structure reactive environment exploration technical report boulder university colorado computer science rivest schapire inference finite automata proceedings annual foundations computer science rivest schapire approach unsupervised learning deterministic environments proceedings fourth international workshop machine learning rumelhart hinton williams learning internal representations error propagation rumelhart mcclelland parallel distributed processing explorations microstructure cognition foundations cambridge books schapire inference automata unpublished masters thesis massachusetts institute technology cambridge
6 connectionist model owls sound localization system daniel rosen department psychology stanford university stanford david rumelhart department psychology stanford university stanford eric knudsen department neurobiology stanford university stanford abstract good understanding theoretical principles learning realized neural systems address problem built computational model development owls sound localization system structure model drawn experimental data learning principles recent work field brain style computation model accounts numerous properties owls sound localization system makes specific testable predictions future experi ments theory developmental process introduction barn remarkable ability localize sounds space complete precision depends skill survival guide current address center neuroscience francisco connectionist model owls sound localization system search knudsen blasdel konishi central owls localization system precise auditory maps space found owls optic tectum external nucleus inferior colliculus development sensory maps poses difficult problem nervous system accuracy depends changing relationships animal environment encodes information location sound source phase amplitude differences sound reaches owls ears differences change dramatically animal head grows genome advance precisely animals head develop environmental factors affect process encode precise development auditory system genome design auditory system adapt environment letting learn precise interpretation auditory cues head ears order understand nature developmental process built connec tionist model owls sound localization system theoretical principles learning knowledge neurophysiology essential system modeled calculates horizontal component sound source location interaural time difference sound reaches ears knudsen konishi computes vertical component signal determining interaural level difference sound knudsen konishi animal processes signals numerous subcortical nuclei form ordered auditory maps space optic tectum figure shows diagram neural circuit neurons optic tectum spatially tuned auditory stimuli cells nuclei respond sound signals originating restricted region space relation knudsen neurons respond exclusively auditory signals cells optic tectum hand encode visual sensory maps drive motor system location auditory visual signal researchers study owls development systematically altering animals sensory experience ways animal sound altering auditory experience altering visual experience disturbance auditory visual cues period neural behavioral bring auditory space back alignment visual andor tune auditory sensitive range binaural sound signals induced place level vlvp computed knudsen visually induced adjustment auditory maps space place level knudsen ability adjust altered sensory signals time greatly restricted knudsen knudsen rosen rumelhart knudsen overview barn owls sound localization system optic tectum space shell figure describing flow auditory information owls sound localization system simplicity connections leading optic shown nuclei labeled included model nuclei process andor information labeled network model model major components network architecture based neuro biology owls localization system shown figure learning rule derived computational learning theory elements model stan dard connectionist units output activations sigmoidal functions weighted inputs learning rule train model standard section describe derived rule defining goal network goal network accurately sound signals sound source locations network discover model world captures relationship sound signals sound source locations recent work connectionist learning theory shown ways design networks search model fits data hand buntine weigend mackay rumelhart durbin golden chauvin press section apply analysis localization network connectionist model owls sound localization system table table showing mathematical terms analysis term meaning model data probability model data training pairs input vector training trial target vector training trial output vector training trial output unit training trial weight unit unit unit activation function unit evaluated term maximized network deriving function maximized network maximize probability model data bayes rule write probability represents model units weights biases represents data define data ordered pairs sound signal represent cues targets train network owls case cues auditory signals target information provided visual system table lists mathematical terms section simplify equation taking natural logarithm side giving natural logarithm monotonic transformation network maximizes equation maximize final term equation represents probability ordered pairs network observes model network settles term remains data constant training ignore choosing model term equation represents probability model prior term bayesian analysis estimation model true data discuss concentrate maximizing rosen rumelhart knudsen assumptions networks environment assume training data pairs auditory visual signals independent rewrite previous term subscript denotes data training pair expand term ignore term sound signals dependent model left task maximizing important note represents visual signal localization decision network attempts predict visual experience auditory experience predict probability making accurate localization decision assume visual signals provide target values network analysis shows auditory follow visual leads accurate localization behavior assumption supported experiments showing vision guide formation auditory spatial maps knudsen knudsen knudsen clarify relationship inputs targets real world probabilistic input exists distribution target values estimate shape distribution case assume target values distributed sound signal visual system detect sound source point space made assumption clarify interpretation network output array element vector represents activity output unit training trial assume output activation units represents expected target case expected binomial distribution output unit represents probability sound signal originated location write probability data model taking natural probability summing data pairs term maximize standard crossentropy term deriving learning rule defined goal derive learning rule achieving goal input unit determine rule compute connectionist model owls sound localization system equations dropped subscript denotes training trial analysis identical trials write derivative units activation function evaluated input choose activation function output units logistic good choice reasons bounded makes sense assume probability sound signal originated point space bounded compute derivative logistic function result term variance binomial distribution return derivative cost function variance term denominator final derivative compute weight output units weights units network updated standard backpropagation learning algorithm model priors types priors model architectural design fixed network architecture previous section based knowledge nuclei involved owls localization system equivalent setting prior probability architecture weight elimination prior similar priors interpreted ways reduce complexity network weigend huberman rumelhart network maximizes expression function error complexity training train model presenting input core inferior colliculus encodes interaural phase time differences angular nuclei encode sound level outputs network compared target values presumed visual system weights adjusted order minimize difference mimic training varying average difference angular input values mimic training systematically changing target values input rosen rumelhart knudsen figure activity level units response auditory immediately simulated training begun left training middle training completed results discussion trained network accurately shows auditory tuning curves modeled nuclei responds appropriately manipulations mimic experiments blocking inhibition level network shows responses changing average binaural intensity level vlvp lateral shell network exhibits properties found developing model appropriately adjusts auditory localization behavior simulated experiments plasticity takes place level vlvp simulations begun progressively training networks ability adapt training gradually time plasticity qualitatively similar sensitive critical periods network adapts appropriately simulated studies response simulations primarily place lateral shell connections studies networks ability adapt time unlike mature highly trained network retains ability adapt simulated experiment discovered derived learning rule models stages adjustment standard backpropagation network knudsen report observing peaks activity response auditory stimulus training response newly learned response time response newly learned grows shown figure network exhibits pattern learning networks trained standard backpropagation learning algorithm connectionist model owls sound localization system result support idea owls localization system computing function similar network designed learn addition accounting data network predicts results experi ments designed mimic specifically network accurately predicted removal animals facial vary azimuth elevation effect animals response varying network goals designed accounts developmental data makes testable predictions future experiments derived learning rule principled fashion network specific theory owls sound localization system references knudsen dynamics visual calibration interaural time difference barn owls optic tectum society neuroscience abstracts knudsen plasticity inferior colliculus site visual calibration neural representation auditory space barn journal neuroscience buntine weigend bayesian backpropagation systems knudsen auditory properties units owls optic tectum journal neurophysiology knudsen early results degraded auditory space optic tectum barn proceedings national academy science knudsen blasdel konishi sound localization barn measured search coil technique journal physiology knudsen knudsen vision adjustment auditory localization young barn owls science knudsen knudsen sensitive critical periods visual tion sound localization barn owls journal neuroscience mackay bayesian methods adaptive models unpublished doctoral dissertation california institute technology pasadena california knudsen adaptive adjustment unit tuning sound localization cues response monaural occlusion developing optic tectum journal neuroscience acoustic location barn owls journal biology rumelhart durbin golden chauvin press backpropaga tion theory chauvin rumelhart backpropagation theory architectures applications hillsdale lawrence associates weigend huberman rumelhart predicting future connectionist approach international journal neural systems
1 automatic local annealing psychology carnegiemellon university pittsburgh abstract research involves method finding global maxima constraint networks process unlike annealing schedule temperature determined locally units update processing unit level major practical benefits processing processing continue areas network good areas remain stable processing continues areas long constraints remain poorly satisfied stop predetermined number cycles result method avoids requiring externally determined annealing schedule finds global maxima quickly consistently externally scheduled systems comparison boltzmann machine ackley made finally implementation method computationally trivial introduction constraint satisfaction network network units represent constraints represented directional connections units positive connection weight suggests hypothesis accepted rejected negative connection weight suggests hypothesis accepted rejected relative importance satisfying constraint absolute size weight acceptance rejection hypothesis activation unit point activation space corresponds solution constraint problem represented network quality solution calculated summing constraints goal find point activation space quality maximum automatic local annealing units update move state satisfies means avoiding local quality maxima activation space simply fundamental problem gradient procedures annealing systems attempt avoid problem giving units probability moving satisfies constraints probability called temperature network high solutions generally good network moves easily activation space temperature network area activation space good improving solution area annealing analogy notion start high lower slowly network gradually replace state state improvement ability guide globally maximal state atoms slowly annealed find optimal structures search solutions requires means determining temperature network annealing systems simply predetermined schedule provide information practical problems approach main practical problems annealing schedule processing quality current solution temperature uniform network parts network merit temperatures case time part network area activation space natural condition theoretical problem approach involves selection annealing schedules order pick schedule network knowledge good solution network order system find solution solution find problem critical elements process temperature decreased handled network quality final solution depend part systems understanding problem allowing unit control temperature processing automatic local annealing avoids addition resolving main practical problems ends finding global maxima quickly reliably externally controlled systems mechanics units continuous activations uniform minimum maximum uniform resting activation units minimum maximum units start random activations updated synchronously cycle ways updated ordinary update rule positive input defined increases activation negative input decreases activation simply reset resting activation update probability function determines probability normal update unit based temperature defined noted input unit calculated trivial quantity rest equation goodness goodness goodness largest goodness largest goodness constants calculated unit beginning simulation depend weights unit constant maximum minimum resting activation values temperature representing high temperature simulations processing networks tested network processed figure local maxima extremely close global maxima difficult network sense search global maximum extremely sensitive minute difference global maxima local maxima network processed figure local maxima close global maxima easy network sense slow process parameters improved performance network order illustrate relative generality algorithm parameters maximum activation minimum activation resting activation normal update rule activation activation automatic local annealing function defines process moves slowly global maximum moves good solutions easily units results results running automatic local annealing process networks comparison standard boltzmann machines summarized figures automatic local annealing probability found stable global maximum fairly processing begins increases smoothly boltzmann machine makes annealing schedule quickly moves solution global maximum order reliability boltzmann machines schedule slow solutions found slowly conversely order start finding solution quickly short schedule reliability worse finally makes reasonable comparison boltzmann machine changing parameters process maximize performance network single annealing schedule boltzmann machine networks performance advantage increases substantially discussion works characteristics approach global maximum determined shape update probability function modifying shape control things network moves global maximum easily moves local maxima good solution order completely stable critical feature function temperature decreases probability normal update increases unit progresses extreme activation unit resting activation figure difficult network global maxima upper units remaining units lower units remaining units local maxima upper left lower units remaining units upper lower left units remaining units figure easy network cube network mcclelland rumelhart units connected shown connections sets clarity global maxima units cube units automatic local annealing automatic cycle cycle cycle cycles processing figure difficult network figure cycle cycle schedule cycle processing figure easy network figure line based trials stable global maxima network remained rest trial annealing schedules performing schedules found units effect movement activation space contribute units cold units compete control critical movement cold units connected units agreement heat connected units disagreement temperature equation connected units begin connected units spreads stabilizing sets units hypotheses agree spreading makes algorithm work units decision units connect case units accordance global criterion quality states networks order global maxima found network general amount time spent proportional amount heat state heat directly related stability heat network stable represent global maxima total constraints infinite processing time commonly visited states global maxima importantly state proportional quality mathematical description developed characteristic good practical benefits employs notion solution update probability function units normal update probabilities temperatures higher simulations condition states completely stable perfectly satisfying constraints time simulation increases probability state approaches approaches proportional quality states good frozen decrease time amount time directly related point times small points large points achieved type tradeoff extremely practical applications measuring performance finds global maxima faster reliably boltzmann machine annealing benefits processing number elements make preferable externally scheduled annealing processes solutions problems found temporarily maintained processing considers constraint satisfaction terms schema processors corresponds nicely simultaneous processing levels schemas obvious solutions filled quickly higher level schemas found real solutions initial part final solution appearance automatic local annealing processing settings biologically feasible externally scheduled systems units function intelligent processor paths traversed activation space schema parallel human closely processing lend simple learning algorithms processing units acting close accord constraints present distant favor units rarely constraints network basic approaches making weight adjustments continuously increasing weights units agreement decreasing weights units disagreement hypotheses minsky power area current research represent enormous time savings boltzmann machine type learning ackley found feasible references ackley hinton sejnowski leaming algorithm boltzmann machines cognitive science mcclelland rumelhart explorations parallel distributed processing cambridge press minsky papert percepttons cambridge press
11 learning spirals relations chen wang department computer information science center cognitive science ohio state university abstract benchmark task spiral problem neural works unlike previous work emphasizes learning approach problem generic perspective involve learning point spiral problem intrinsically connected problem generic solution problems proposed based oscillatory correlation time delay network simu lation results qualitatively consistent human performance interpret human limitations terms synchrony time delays biologically plausible special case network time delays distinguish figures shape position size orientation introduction spiral problem refers distinguishing connected single spiral nected double spirals illustrated minsky papert intro duced problem book perceptrons received attention benchmark task neural networks solutions attempted learning models lang witbrock reported problem solved standard multilayer perceptron resulting learning systems produce decision regions highly constrained spirals training specific shape position size orientation explanation provided problem difficult human subjects solve grossberg proposed biologically plausible neural network architecture figureground separation reported network distinguish connected disconnected spirals paper demonstration spiral problem model exhibit limitations humans national laboratory machine perception center information science university email learning related problem study visual perception perception relations visual input single closed curve task relation determine specific pixel lies inside closed curve human visual system perception relations appears perception humans bounding contour highly ullman ullman suggested computation spatial relation visual routines visual routines result conjecture inherently sequential pointed recently ullman processes underlying perception relations unknown applying visual routines simply alternative spiral problem connected single spiral disconnected double spirals adapted minsky papert relations ample adapted adapted ullman theoretical investigations brain functions timing neuronal activity construction neuronal assemblies malsburg partic ular discovery synchronous oscillations visual cortex singer gray triggered interest develop computational models oscillatory correlation recently wang proposed locally excitatory globally inhibitory networks legion theoretically showed legion rapidly achieve synchronization locally coupled oscillator group representing object number oscillator groups representing objects recently campbell wang studied time delays networks relaxation oscillators analyzed behavior legion time delays studies show loosely synchronous solutions achieved broad range initial condi tions time delays legion computational framework study process visual perception standpoint oscillatory correlation explore spiral problem relations oscillatory correla tion paper show computation legion time delays yields generic solution problems time delays occur information trans mission biological system investigation perceptual performance limited local activation rapidly propagated time delays special case legion time delays reliably distinguishes connected disconnected spirals inside shape sition size orientation suggest kind problems solved neural oscillator network sophisticated learning methodology architecture legion paper twodimensional network connected nearest neighbors global receives excitation oscillator network turn inhibits oscillator chen wang wang legion single oscillator defined represents external stimulation oscillator represents coupling oscillators network symbol denotes amplitude gaussian noise parameters chosen control periodic solution dynamic system periodic solution alternates silent active phases steadystate behavior wang coupling term time parameter controls steepness sigmoid function synaptic weight oscillator oscillator neighbors time delay interactions campbell wang threshold oscillator affect neighbors positive weight inhibition global activity defined oscillator oscillator represents threshold determine sends tion oscillators parameter determines rate stimulation oscillators pattern formation refer behavior oscillators representing object synchronous oscillators representing objects wang analytically shown solution achieved legion time delays solution achieved time delays introduced synchrony concept introduced describe time delay behavior campbell wang pattern tion entire network synchrony achieved synchrony local concept defined terms pairs neighboring oscillators intro duce measure called difference order examine pattern formation achieved suppose oscillators represent pixels object oscillator represents pixel object denote time oscillator enters active phase difference measure defined time period active phase intuitively measure suggests pattern formation achieved oscillators representing pixels object overlap active phase oscillators representing pixels belonging objects stay active phase simultaneously definition pattern formation applies exact synchrony legion time delays synchrony time delays simulations image consisting pixels twodimensional legion network oscillators oscillator network corresponds pixel image simulations equations numerically solved method illustrate stimulated oscillators black squares oscillators initialized randomly large number simulations learning conducted broad range parameter values network sizes chen wang report typical results specific parameter values spiral problem simulations images sampled binary images pixels images problems addressed image presented determine single spiral double spirals point twodimensional plane determine inside specific spiral results legion time delay period oscillation spiral problem parameter values simulation external input stimulated oscillators applied legion time delays single spiral image illustrates visual stimulus black pixels correspond stimulated oscillators white correspond oscillators shows sequence snapshots network stabilized snapshot shows random initial state network snapshots arranged temporal order left bottom observe snapshots activated oscillator spiral propagates activation neighbors time delay process propagation forms traveling wave spiral emphasize time oscillators portion spiral stay active phase entire spiral active phase simultaneously based oscillatory correlation theory system group spiral system fails realize pixels spiral belong pattern note part background behaves similarly shows temporal trajectories combined activities oscillators representing spiral background temporal activity difference measure shows pattern formation achieved order illustrate effects time delays applied legion time delays image simulation results show pattern formation achieved single spiral segregated background period chen wang legion time delays readily solve spiral problem case failure group spiral caused time delays coupling neighboring oscillators applied legion time delays double spirals image shows visual stimulus shows sequence snapshots arranged order observe snapshots starting spiral traveling wave formed spiral activated oscillators representing spiral propagate activation time delays oscillators portion spiral stay active phase entire chen wang spiral active phase simultaneously oscillators representing spiral behavior results show pixels double spirals grouped pattern mention behavior system part background similar double spirals evident pattern formation achieved network stabilized applied legion time delays double spirals image purpose simulation results show spirals segregated spiral background period chen wang failure group double spirals results time delays results legion time delays spiral problem parameter values listed represent disconnected spirals denote background global spiral problem pattern formation means solutions problems tion provided questions counting number objects identifying pixels belong spiral solutions pattern formation achieved system solve spiral problem general special condition time delay system solve prob relations simulations pictures sampled binary images pixels applied legion time delays images figures show visual stimuli black pixels represent areas respond stimulated oscillators white pixels represent boundary corresponds oscillators figures illustrate sequence snapshots networks stabilized snapshot shows random initial states networks figures show temporal trajectories combined activities oscillators representing areas results legion time delay parameter values simulation parameter values learning listed denote areas global results legion time delay parameter values statements listed observe activation oscillator rapidly propagate neighbors oscillators representing area eventually tors representing area stay active phase simultaneously generally enter active phase times time delays basis oscillatory correlation system group entire area recognize pixels area elements area difference measure shows pattern formation achieved period contrast observe activated oscillator rapidly propagates activation open regions shown snapshots prop agation limited traveling wave spreads regions shown earlier snapshots result time oscillators portion area stay active phase oscillators representing area active phase simultaneously basis oscillatory correlation system group area fails identify pixels area belonging pattern difference measure shows pattern formation achieved network stabilized order illustrate effects time delays show oscillator network perceive relations applied legion time delays images simulations show legion time delays readily areas cases period chen wang failure group area attributed time delays coupling neighboring oscillators general simulations suggest oscillatory correlation relations neural network pattern formation achieved single area areas image specific point twodimensional plane relations identified examining oscillator representing point oscillators representing specific area discussion conclusion reported neural network models solve spiral problem learning solutions subject limitations generalization ities resulting learning systems highly depend training pointed minsky papert solving spiral problem equivalent detecting connect showed computed perceptrons minsky papert limitation holds multilayer perceptrons learning scheme minsky papert chen wang people discussed generality solutions contrast simulations shown legion time delays distinguish figures shape position size orientation emphasize learning involved terms performance suggest spiral problem solved network oscillators learning system alternative perceive relations neural computation perspective method significantly distinguished visual routines ullman visual routine method serial algorithms system inherently parallel distributed process emergent havior reflects degree serial nature problems visual routine method make qualitative distinction rapid perception corresponds simple boundaries slow perception corresponds bound time visual routine method takes varies continuously contrast system makes distinction perception simple boundaries corresponds pattern formation achieved perception boundaries corresponds pattern formation achieved importantly conceptually system highlevel serial process solve problems relations solution involves mecha parallel image segmentation wang acknowledgments authors grateful campbell discussions work supported part grant grant young investigator award references campbell wang relaxation oscillators time delay coupling physica chen wang learning spirals relations technical report ohio state university grossberg neural network architecture figureground separation nected figures neural networks perception press lang witbrock learning spirals proceeding connectionist models summer school morgan kaufmann model visual shape recognition psychological review minsky papert perceptrons press minsky papert perceptrons extended version press singer gray visual feature integration temporal correlation hypothesis annual review neuroscience wang global competition local cooperation network neural oscillators physica ullman visual routines cognition ullman highlevel vision press malsburg correlation theory brain function internal report biophysical chemistry wang image segmentation based oscillatory correlation neural putation
12 nonnegative boltzmann machine hopfield group building princeton university princeton david mackay laboratory road cambridge united daniel bell laboratories technologies mountain murray hill labs abstract nonnegative boltzmann machine nnbm recurrent neural work model describe multimodal nonnegative data application maximum likelihood estimation model learning rule analogous binary boltzmann machine examine utility field approximation nnbm describe monte carlo sampling techniques learn parameters tive slice sampling wellsuited distribution efficiently implemented sample distribution illustrate learning nnbm invariant distribution generative model images human faces introduction multivariate gaussian elementary distribution model generic represents maximum entropy distribution constraint covariance matrix distribution match data case binary data maximum entropy distribution matches order statistics data boltzmann machine probability state boltzmann machine exponential form interpreting neural network parameters represent symmetric recurrent weights units network represent local biases parameters simply related observed covariance nonnegative boltzmann machine figure probability density shaded contour plot dimensional tive nnbm distribution energy function distribution saddle point local minima generates observed multimodal distribution data normal gaussian adapted iterative learning rule involves difficult sampling binary distribution boltzmann machine generalized continuous nonnegative variables case maximum entropy distribution nonnegative data order statistics distribution previously called rectified gaus sian distribution energy fimction normalization constant properties nonnegative boltzmann machine nnbm distribution differ substantially normal gaussian presence tivity constraints distribution multiple modes shows twodimensional nnbm distribution separate maxima located axes multimodal distribution poorly modelled single normal gaussian discuss multimodal nnbm distribution learned nonnegative data show limitations field approximations distribu tion illustrate recent developments efficient sampling techniques continuous belief networks tune weights network specific examples learning demonstrated invariant distribution gener ative model face images maximum likelihood learning rule nnbm derived maximizing likelihood observed data nonnegative vectors mackay indexes examples likelihood taking derivatives respect parameters subscript denotes clamped average data subscript denotes free average nnbm distribution derivatives define gradient ascent learning rule nnbm similar binary boltzmann machine contrast clamped free covariance matrix update difference clamped free means update local biases field approximation major difficulty learning algorithm lies evaluating averages analytically intractable calculate free averages approximations learning field approximations previously proposed deterministic alternative learning binary boltzmann machine views validity investigate utility field theory approximating nnbm distribution field equations derived approximating nnbm distribution factorized form marginal densities characterized means fixed constant product distributions natural distribution negative random variables optimal field parameters determined minimizing kullbackleibler divergence nnbm distribution factorized distribution finding minimum setting derivatives respect field parameters simple field equations nonnegative boltzmann machine figure slice sampling dimension current sample point height randomly chosen defines slice chosen multidimensional slice point chosen ballistic dynamics reflections interior boundaries slice equations solved free statistics nnbm replaced statistics factorized distribution fidelity approximation determined factorized distribution models nnbm distribution distributions shown field approximation true nnbm distribution suggests naive field approximation learning nnbm fact attempts approximation fail learn examples sections field approximation initialize parameters reasonable values sampling techniques montecarlo sampling direct approach calculating free averages numerically accomplished monte carlo sampling generate representative points sufficiently approximate statistics continuous tribution markov chain montecarlo methods employ iterative stochastic dynamics equilibrium distribution converges desired distribution binary boltzmann machine sampling dynamics involves random spin flips change single binary component single compo nent dynamics easily local energy minima converge slowly large systems makes sampling binary distribution difficult computational techniques simulated annealing cluster updates developed circumvent problem nnbm continuous variables makes investigate stochastic dynamics order efficiently sample distribution experi mented gibbs sampling ordered found required inversion error function computationally expensive recently developed method slice sampling wellsuited implementation nnbm basic idea slice sampling algorithm shown sample point random uniformly chosen slice defined connected points point chosen mackay figure contours twodimensional competitive nnbm distribution field approximation reflected slice samples randomly slice distribution large shown converge desired density nnbm solving boundary points direction slice simple involves solving roots quadratic equation order efficiently choose point slice ball dynamics random initial velocity chosen point evolved travelling distance current point reflecting boundaries slice intuitively reflections dynamics satisfy detailed balance field approximation slice sampling twodimensional competitive nnbm distribution poor field approximation apparent factorized density points slice sampling algorithm representative nnbm distribution higher dimensional data field approximation progressively worse implement numerical slice sampling algorithm order accurately approximate nnbm distribution invariant model proposed model orientation tuning primary visual cortex interpreted cooperative nnbm distribution absence visual input firing rates cortical neurons minimizing energy function parameters distribution test nnbm learning algorithm large dimensional nonnegative mining vectors generated sampling distribution samples mining data parameters learned unimodal initialization evolving mining vectors slice sampling evolved vectors calculate free averages estimates updated procedure iterated evolved averages matched training data learned parameters found match original form representative samples learned nnbm distribution shown nonnegative boltzmann machine figure representative samples nnbm training learn translation ally invariant cooperative distribution figure face image successive sampling learned nnbm distribution samples generated normal gaussian generative model faces nnbm learn generative model images human faces nnbm model correlations coefficients nonnegative matrix face images reduces dimensionality nonnegative data decomposing face images parts eyes ears parts reconstructing face activations parts significant correlations captured generative model briefly demonstrate nnbm learn correlations sampling nnbm stochastically generates coefficients graphically displayed face images shows representative face images slice sampling dynamics evolves coefficients displayed figure anal images generated normal gaussian model correlations clear constraints multimodal nature nnbm results samples distinct faces mackay discussion introduced nnbm recurrent neural network model describe multimodal nonnegative data application made practical efficiency slice sampling monte carlo method learning algorithm incorporates numerical sampling nnbm distribution learn observations ative data demonstrated application nnbm learning cooperative invariant distribution real data images human faces extensions present work include incorporating hidden units recurrent work addition hidden units implies modelling higher order statistics data requires calculating averages hidden units anticipate marginal distribution units commonly unimodal field theory valid approximating averages extension involves generalizing nnbm model continuous data confined range situation slice sampling techniques efficiently generate representative samples case hope work research types recurrent neural networks model complex multimodal data acknowledgements authors acknowledge discussion john hopfield sebastian seung indebted sompolinsky pointing maximum entropy interpretation boltzmann machine work funded bell laboratories technologies grateful support open ears references hinton sejnowski optimal perceptual learning ieee conference puter vision pattern recognition washington ackley hinton sejnowski learning algorithm boltzmann cognitive science seung rectified gaussian distribution advances neural information processing systems mackay introduction monte carlo methods learning graphical models kluwer academic press nato science series galland limitations deterministic boltzmann machine learning network kappen field approach learning boltzmann machines pattern recognition practice amsterdam neal suppressing random markov chain monte carlo ordered technical report dept statistics university toronto neal markov chain monte carlo methods based density function technical report dept statistics university toronto sompolinsky theory orientation tuning visual cortex proc acad seung learning parts objects nonnegative matrix factor ization nature
9 provably kowalczyk research laboratories road australia abstract results study worst case learning curves partic ular class probability distribution input space hard threshold hidden units presented shown partic ular thermodynamic limit scaling number connections hidden layer true learning curve behaves vcdimension based bound trivial bound trivial shown bounds true learning curve derived formalism based density error patterns introduction extensions link generalisation capabilities binary neural network counting function upper bounds implied vcdimension function linear perceptrons counting function constant selection fixed number input samples essentially equal upper bound determined vcdimension lemma case multilayer percepttons counting function depends essentially selected input samples instance shown recently sigmoidal units largest number input samples shattered vcdimension equals nonzero probability finding input sample shattered number weights network case heaviside sigmoidal activations mcculloch pitts neurons similar claim made vcdimension partition function computational learning theory provably generalize number weights hidden layer units nonzero probability finding sample size shattered results hard samples types differ significantly terms techniques derivation sigmoidal case result based recent advances model theory heaviside case proofs constructive defining class probability distributions hard samples drawn randomly results case explicit form counting function existence hard samples essential generalisation capabilities essential factor improvement theoretical models generalisation paper show mcculloch pitts case specific continuous probability distributions input space answer estimate directly real learning curve case show bounds based vcdimension learning sample regimes training samples examples linear perceptron show modification significantly bound part rigorous formal extension results results presented thermodynamic limit training sample size increasing proportionally simplifies mathematical form overview formalism sample space class binary functions call hypothesis space assume probability target concept called learning system usual hypothesis associate generalization error training error training learning threshold introduce auxiliary random variable giving worst general ization error hypotheses training error basic objects interest paper learning curve defined thermodynamic limit introduce thermodynamic limit learning curve idea asymptotic analysis capture essential features learning paper denotes maximal element closure element exists similarly understand learning curve determined worst generalisation error accept hypotheses respect differs average generalisation error learning curves considered kowalczyk systems large size mathematically turns thermodynamic limit functional forms learning curves simplify significantly analytic char sequence learning systems shortly scaling property scaling thought measure size complexity learning system vcdimension thermodynamic limit scaled learning curves defined additional subscript refers learning system error pattern density formalism subsection briefly presents thermodynamic version modified formal discussed previously details proofs found main innovation approach splitting error patterns error estimates size error total number error patterns examples discussed section improves results significantly space binary naturally splits error pattern shell composed vectors entries equal denote vector error pattern position error shell elements average error pattern density falling error shell denotes cardinality theorem sequence learning systems function recall denotes largest integer defined monotonic sequence note contrast ordinary limit exists difference concept error partitions finite hypothesis space generalisation error values related central result theorem derived theorem provably generalize denotes entropy function main results applications formalism learning sequence case scaling sequence vcdimension assume bounds learning system derived consistent learning case rain thermodynamic limit note bound independent probability distributions piecewise constant functions denote class piecewise constant binary functions unit segment discontinuities values defined discontinuity points learning sequence continuous probability distributions monotonic sequence positive integers targets limit exists loss generality assume uniform distribution learning sequence established claim function assumption respect claim sided bound learning curve holds kowalczyk outline main steps proof claims claim start combinatorial argument establishing ticular case constant target observe equals easily claim constant target observe case upper bound general case target effective number discontinuities claim start estimate derived result constant target const implies immediately expression constant target extends estimate straightforward lower upper bound effective number discontinuities case target link multilayer perceptron denote class function imple mented multilayer perceptron feedforward neural network number hidden layers connections hidden layer hidden layer composed fully connected linear threshold logic units units implement mapping form shown properties determinant mapping coordinates composed linearly independent polynomials generic situation degree implies immediately results learning class functions section applicable obvious modifications class multilayer perceptrons probability distribution concentrated curves form step extend distribution tinuous distribution support sufficiently close curve provably generalize scaled training sample size entropy figure plots estimates thermodynamic limit learning curves sequence multilayer perceptrons claim consistent learning estimates true learning curve upper lower bound upper bounds form modified claim plotted marked comparison plot bound based bound trivial scaling corollary shown error pattern densities learning curves small desired observation implies result claim sequence multilayer perceptrons exists sequence continuous probability distributions properties sequence targets claim claim section hold learn sequence scaling bound earning curve holds claim corollary additionally number units hidden layer thermodynamic limit respect scaling trivial proof bound trivial continuous probability input space bound trivial possibility dimension based bounds applicable fail ture true behavior independence distribution tion situation estimate expectation logarithm counting function number dichotomies perceptron input points case lower bound general position virtually lemma vcdimension replaced kowalczyk entropy dimension lemma hope result bounds form replacing vcdimension resulting bound thermodynamic limit respect scaling note entropy based bounds obtained prior distribution hypothesis space account plots learning curves shown figure acknowledgement permission research laboratories paper gratefully acknowledged references blumer ehrenfeucht haussler warmuth learnability vapnikchervonenkis dimensions journal cover geometrical statistical properties linear inequalities appli cations pattern recognition ieee trans elec comp kearns bounds sample complexity bayesian learning information theory dimension machine learning haussler kearns seung tishby rigorous learning curve bounds statistical mechanics proc pages niranjan practical applicability dimension bounds neural computation koiran sontag neural networks quadratic vcdimension proc nips pages press cambridge kowalczyk counting function theorem multilayer networks proc nips pages morgan kaufman publishers kowalczyk estimates storage capacity multilayer perceptton threshold logic hidden units neural networks kowalczyk generalisation feedforward networks proc nips pages press cambridge kowalczyk asymptotic version generalisation learning systems preprint kowalczyk williamson learning curves modified case study proc ieee kowalczyk bartlett williamson examples learn curves modified proc nips pages press neural nets neural computation random division interval proc cambridge phil sakurai tighter bounds vcdimension threelayer networks proc world congress neural networks sontag sets points general position requires parameters report rutgers center systems control vapnik estimation dependences based empirical data springerverlag vapnik nature statistical learning theory springerverlag
5 remote sensing image analysis texture classification neural network greenspan rodney goodman department electrical engineering california institute technology pasadena abstract work apply texture classification network remote sensing analysis goal extract characteristics area depicted input image achieving segmented region recently proposed combined neural network rulebased framework texture recognition framework unsupervised supervised learning probability estimates output classes describe texture classification network extend demonstrate application image analysis domain introduction work apply texture classification network remote sensing image analysis goal segment input image homogeneous textured regions identify region library textures tree area area distinction classification remote sensing imagery importance applications navigation exploration complex task spanning growing number sensors application domains applications include identification systems spot analysis mapping sensor exploration type classification input attention spectral signature greenspan goodman tion region types recently idea adding spatial information presented work investigate possibility information analysis recently developed texture recognition system greenspan achieves stateoftheart results natural textures paper apply system remote sensing imagery check systems robustness noisy environment texture play major role segmenting images areas enhancing sensors capabilities analysis indicating areas interest analysis pursued fusion spatial information spectral signature enhance classification automated analysis capabilities work literature focuses human rules specific sensor data calibration existing problems classic approach experienced required spend considerable amount time generating rules rules updated regions spatial rules exist complex imagery interesting question rule generation paper present learning framework spatial rules learned system database examples learning framework contribution system topic section experimental results systems application remote sensing imagery presented section network previously presented texture classification network combines neural network rulebased framework greenspan enables unsupervised supervised learning system consists major stages shown stage performs feature extraction transforms image space array feature vectors vector correspond local window original image evidence animal visual systems supporting multichannel orientation selective bandpass filters phase open issue decision appro number frequencies orientations required representation input domain define initial filters achieve computationally efficient filtering scheme multiresolution pyramidal approach learning mechanism shown derives minimal subset filters conveys sufficient information visual input differentiation labeling unsupervised stage clustering algorithm continuous input features supervised learning stage labeling input domain achieved rulebased network information theoretic measure utilized find informative correlations attributes pattern class specification providing proba bility estimates output classes ultimately minimal representation library patterns learned training mode classification remote sensing image analysis texture classification neural network supervised vised window continuous input image feature vector feature vector texture classes learning phase phase figure system block diagram patterns achieved system detail initial stage classification system feature extraction phase task biological computational evidence support filters work gabor pyramid gabor wavelet decomposition define initial finite filters computational efficient scheme involves pyramidal representation image convolved fixed spatial support oriented gabor filters greenspan scales orientations scale degrees component produce feature vector output feature extraction stage pyramid representation computationally efficient image subsampled filtering process size reduction stages place scale pyramid feature values generated correspond average power response specific orientation frequency ranges window input image window mapped attribute vector output feature extraction stage goal learning system feature representation discriminate input patterns textures unsupervised supervised learning stages utilized minimal features extracted fiom attribute vector convey sufficient information visual input differentiation labeling unsupervised learning stage viewed preprocessing stage compact representation filtered input goal continuous valued features result initial filtering shifting symbolic representation input clustering stage found experimentally importance initial learning phase classi fication system discretization evident learn associations attributes symbolic representation rules greenspan goodman output filtering stage consists continuous valued feature maps representing filtered version original input local area input image represented ndimensional feature vector array ndimensional vectors viewed input image input learning stage detect characteristic behavior dimensional feature space family textures learned work dimension attribute vector individually clustered training samples projected axis space onedimensional clusters found kmeans clustering algorithm duda hart statistical clustering technique consists iterative procedure finding means training sample space input sample closest euclidean distance means labeled minus arbitrarily correspond discrete codewords continuousvalued input sample mapped discrete codeword representing output preprocessing stage quantized vector attributes result codewords individual dimensions final supervised stage utilize existing information feature maps higher level analysis input labeling classification rule based information theoretic approach extension order bayesian classifier ability output probability estimates classes goodman classifier defines correlations input features output classes probabilistic rules data driven supervised learning approach utilizes information theoretic measure learn informative links rules features class labels classifier links provide estimate probability output class true presented input evidence vector rules considered fire classifier estimates posterior probability class rules fire form largest estimate chosen initial class label decision probability estimates output classes feedback purposes higher level processing rulebased classification system mapped layer feed forward architecture shown greenspan input layer node attribute hidden layer node rule output layer node class rule layer node connected class multiplicative weight evidence inputs rules class figure rulebased network remote sensing image analysis texture classification neural network results system achieved stateoftheart results structured unstructured natural texture classification work present initial results applying network noisy environment satellite imagery presents examples image pasadena california system imaging spec system covers contiguous spectral bands ously pixel resolution presented average bands visual range input image major distinguishing characteristic area surround categories learn training consists image sample category test input image noisy resolution difficult segment categories visual perception presented output area labeled white gray unknown areas darker gray rough segmentation desired regions achieved probabilistic networks output identification unknown unspecified regions elaborate analysis pursued greenspan dark gray areas correspond regions ample hill contact bottom hill slopes form mixture classes note initial results presented perceived result analysis resolution chosen system additional spectral bands input enable pixel resolution enable detecting additional classes visually concrete material variety higher resolution image presented bottom classes learned output label dark gray ground output label gray structured area field present structures white training image examples class input image result presented classes found rough segmentation regions achieved note detection areas main structured areas image including field white final relates autonomous navigation scenario autonomous require automated scene analysis system avoid obstacles rough fusion visual modalities segmentation texture stereo color domain inputs spectral decomposition analysis required challenging task present preliminary results scenes autonomous vehicle propulsion laboratory pasadena presented scenes left segmented regions training consists image samples category pixel image light gray black represents gions intensity suffice task corner system learned characteristics guided greenspan goodman figure remote sensing image analysis results input test image shown left system output classification input white regions gray area dark gray reflects region types output bottom dark gray area light gray ground cover region white structures robustness noise generalization demonstrated challenging realworld problems remote sensing image analysis texture classification neural network segmentation regions note prob identifying region center bottom regions learn regions category specifically include examples training input image bottom light gray dark gray represents region black represents unknown category region labeled correctly unknown category note intensity confused region texture classification neuralnetwork achieving correct rough segmentation scene based characteristics encouraging results indicating learning system learned informative characteristics domain image analysis autonomous navigation greenspan goodman summary discussion presented results demonstrate networks capability generalization robustness noise challenging realworld problems presented frame work learning mechanism rule generation framework current difficulties human experts knowledge automation rule generation enhance experts knowledge task hand demonstrated tial information segment complex homogeneous regions systems strengths include generalization scenes invariance intensity ability feature vector representation include additional inputs additional spectral bands learn rules characterizing inte modalities future work includes modalities learn framework enhanced performance testing performance large database acknowledgement work supported part bell part darpa grant greenspan supported part intel fellowship research paper carried part propulsion laboratories california institute technology anderson pyramid software support autonomous vehicle images references approach discrimination ieee transactions remote sensing spectral texture pattern matching classifier digital imagery ieee transactions remote sensing jain knowledgebased segmentation images transactions remote sensing greenspan goodman combined neural network rulebased framework probabilistic pattern recognition discovery moody hanson lippman advances neural informa tion processing systems mateo morgan kaufmann publishers greenspan goodman anderson learning texture discrimination rules multiresolution system submitted ieee transactions pattern analysis machine intelligence duda hart pattern classification scene analysis john wiley sons goodman miller smyth rulebased networks classification probability estimation neural computation
9 constructive network writer adaptation john platt jose abstract paper discusses fairly general adaptation algorithm standard neural network increase recognition specific user basis algorithm output neural network characteristic input output incorrect exploit characteristic output output adaptation module maps correct confidence vector simplified resource allocating network constructs basis functions online applied construct character recognition system online hand printed characters decreases word error rate test average creating basis functions writer test introduction major difficulties creating statistical pattern recognition system statistics training statistics actual creation statistical pattern recognizer considered regression problem class probabilities estimated fixed training statistical pattern recognizers tend work typical data similar training data work data represented training poor performance data problem human people tend provide drastically data figure solution difficulty create adaptive recognizer treat recognition static regression problem recognizer adapt statistics applied online handwriting recognition adaptive platt typical consistent incorrect neural network response correct neural network response figure input data neural network produces consistent incorrect output pattern recognizes consistent pattern produces corrected output recognizer improves accuracy user adapting recognizer user paper proposes method creating adaptive recognizer call output adaptation module inspired development neural network handwriting recognizer noticed output neural network characteristic input specific style character shown network networks output consistent specific style output incorrect exploit consistency incorrect outputs decided network learns recognize consistent incorrect output vectors produces correct output vector figure units radial basis functions adaptation units performed simplified version resource allocating network algorithm platt number units scales number presented learning examples contrast algorithms allocate unit learned properties recog adaptation fast user provide additional examples data recognition speed degradation modest amount additional memory user required limited neural network recognizers output corrected vector contextual postprocessing single label constructive network writer adaptation relationship previous work related previous work user adaptation neural recognizers speech handwriting previous user adaptation neural handwriting recognizer employed time delay neural network tdnn layer tdnn replaced tunable classifier adaptation guyon layer tdnn replaced knearest neighbor classifier work extended layer tdnn replaced optimal hyperplane classifier retrained adaptation purposes optimal hyperplane classifier retained accuracy knearest neighbor classifier reducing amount computation memory required adaptation present work improves previous handwriting systems ways require retraining storage entire layer network reduces memory requirements produces output vector simply output label vector effectively contextual postprocessing step label adaptation experiments performed neural network recognizes full character previous papers experimented neural networks recognized character subsets difficult adaptation problem related stacking stacking outputs multiple recognizers combined training partitions training multiple outputs recognizer combined memorybased learning trained statistics actual predefined training partition output adaptation module section paper describes detail section describes application create handwriting recognizer maps output neural network output adding adaptation vector depending neural network training algorithm output neural network output estimate posterjori class probabilities suitable postprocessing goal bring output closer ideal response experiments target neuron correct character neurons adaptation vector computed radial basis function network takes input center radial basis function distance metric parameter controls width basis function platt desired neural input figure architecture decreasing function controls shape basis functions amount correction basis function adds output call memories adaptation module call correction vectors figure function decreasing polynomial function distance function euclidean distance metric input vectors range order reduce spurious noise algorithm constructing radial basis functions simplification algorithm starts memories corrections user recognition error finds distance nearest memory vector distance greater threshold unit allocated memory vector correction vector correct error step size distance unit allocated correction vector nearest memory updated correct error step size experiments values chosen learning speed gain learning stability number radial basis functions grows number errors units allocated errors constructive network writer adaptation errors similar algorithm updates correction vectors simplified rule computation nearest memory additional memory target corresponds highest output memory considered order prevent allocating memories neural network output unambiguous memory prevents affecting output written characters adaptation algorithm character shown network user error target vector correct character target vector highest element allocate memory index memory memories exist mini mini experiments results test effectiveness create hand writing recognition system connected outputs writer independent neural network trained recognize characters handprinted boxes neural network carefully tuned multilayer feedforward network trained backpropagation algorithm network inputs hidden units outputs input vector class range input upper case character lower case character digit input member subset characters tested tests performed writers disjoint training writers neural network writers writing styles difficult network recognize test characters entered writers instructed write examples characters reflected writing style words writers word list consist combination characters users shown results combined words processed dictionary system failed recognize word correctly misclassified characters desired labels adapt system platt writer adaptation adaptation word writer adaptation adaptation word figure cumulative number word errors writer writer adaptation writer word error word error memories stored words written test test table quantitative test results user figure shows performance writer writer total number word errors adaptation started plotted number words shown baseline cumulative error slope curve estimate instantaneous error rate slope writers decrease dramatically adaptation progresses test word error rate writer word error rate writer examples show substantially improve accuracy neural network quantitative results shown table word error rates obtained compared baseline word error rates columns number stored basis functions number words tested average errors test accuracy constructive network writer adaptation rates entire test count errors made adaptation taking place test true error rates writers lower shown table figure experiments showed adapts quickly requires small amount additional memory computation writers sentations variant character sufficient adapt maximum number stored basis functions experi ments substantially affect recognition speed system conclusions designed widely applicable output adaptation module place standard neural networks takes output network input determines additional adaptation vector output adaptation vector computed radial basis function network learned simplification algorithm nice properties examples needed learn inputs number stored memories grows number errors recognition rate neural network unaffected adaptation output module confidence vector suitable postprocessing addresses difficult problem creating adaptive applied create handwriting recognition system test difficult writers adaptation module decreased error rate stored basis functions writer acknowledgements steve nowlan helpful suggestions development algorithm work neural network work neural network references guyon henderson albrecht denker writer inde pendent writer adaptive neural network online character recognition editor pixels features amsterdam elsevier kadirkamanathan niranjan function estimation approach sequential learning neural networks neural computation guyon denker vapnik writer adaptation online handwritten character recognition tokyo ieee computer society press platt network function interpolation neural putation radial basis functions multivariate interpolation review mason editors algorithms approximation oxford press wolpert stacked generalization neural networks
10 properties natural images german nadal paris cedex france abstract scale invariance fundamental property ensembles images gaussian properties understood existence rich statis tical structure work present detailed study marginal statistics variable related edges images numerical analysis shows exhibits extended scaling property stronger moments expressed power moment interesting exponents predicted terms multiplicative process model recently predict correct exponents structure functions flows results study underlying singularities find singular structures onedimensional singular manifold consists sharp edges category visual processing introduction important motivation studying statistics natural images relevance modeling visual system development email correspondence addressed email email paris paris properties natural images lead adaptation visual processing statistical regularities visual scenes predictions development receptive fields obtained gaussian description environment contrast statistics gaussian properties found important gain insight gaussian aspects natural scenes investigate similarity properties edge type variable scale invariance natural images property appears power behaviour power spectrum contrast parameter depends images included dataset detailed analysis scaling properties contrast authors noted analogy statistics natural images flows model explain scaling behaviour observed hand large amount effort understand statistics flows develop predictable models qualitative quantitative theories fully developed turbulence elaborate original argument cascade energy scale terms local energy dissipation unit mass linear size quantity component velocity point variable similarity properties range scales called range denotes moment energy dissipation marginal distribution general scaling relation called extended found valid larger scale domain relation exponent moment respect ment notice holds refer moments local edge variance images basic field contrast define difference average analogy definition variable variation contrast choose study variables defined position scale variable takes contributions edges horizontal segment size vertical variable defined similarly direction refer derivative direction edge direction justified sense presence borders derivative great nadal evaluated inside surface sharp edges maxima derivative definition local linear edge variance direction scale remark edges important characterizing images recent numerical analysis suggests natural images composed statistically independent edges analyzed scaling properties local linear edge variances images forest pixels images provided ruderman technical details analysis image resolution finite size effects existence upper lower approximately show holds range scales exponents illustrated logarithm moments horizontal vertical local edge variances plotted function holds range holds considered range representative graphs shown linear dependence observed horizontal vertical directions similar found turbulence property obtain accurate estimation exponents structure functions references exponents estimated squares regression shown function error bars refer statistical dispersion figs sees horizontal vertical directions similar statistical properties exponents differ figl surprisingly holds directions exponents multiplicative processes scaling models predict nents holds exponents obtained measuring simplest scaling hypothesis random variable observed scale probability distribution obtained scale derives easily holds flows corresponds prediction exponents shows naive scaling violated discrepancy dramatic expressed terms normalized variable taking shown maximum fact finite variable defined distribution scaling relation identity hold generalize scaling hypothesis longer constant stochastic variable scaling relation introduced context flows integral representation general necessarily properties natural images linear exponents kernel chosen predicted terms multiplicative processes factor stochastic variable determined kernel scale arbitrary scale reached scale kernel obey composition obtained cascade infinitesimal processes specific choices define models model corresponds simple process probability probability stochastic process yields distribution exponents expressed terms test models exponents obtained image data resulting model shown vertical horizontal exponents fitted integral representation directly tested probability distributions evaluated data show prediction obtained compared actual predict exponents obtain exponents parameter chosen asymptotic exponent prefer definition square determine obtaining horizontal variable vertical analysis partition image sets pixels singularity exponent local edge variance defines dimensions legendre transform dimension images interested singular manifolds call dimension singularity exponent maximum variable singular manifold points model data obtain result singular structures dimensional reflects fact singular manifold consists sharp edges conclusions main result work existence trivial scaling properties local edge variances property appears similar observed turbulence local energy dissipation fact model predicts relevant exponents describes scaling behaviour edges image ensemble interesting simple generative model images nadal correct power spectrum reproduce selfsimilar properties found work acknowledgements grateful ruderman giving image data base stimulating discussions discussion link scaling exponents dimension singular structure fruitful discussions acknowledge collaboration early stages work work partly supported french program grant references field lett phys physica phys lett france barlow sensory communication press cambridge comp physiology atick network olshausen field nature cognitive science press nadal phys lett ruderman bialek phys lett ruderman network turbulence cambridge univ press bell sejnowski vision research phys lett phys physica phys lett ruderman vision research properties natural images figure test plot pixels horizontal direction vertical direction figure test plot pixels horizontal direction vertical direction nadal figure exponents vertical horizontal variables direction vertical direction solid line represents model obtained figure verification validity integral representation kernel horizontal local edge variance largest scale starting histogram denoted crosses distribution parameter kernel prediction distribution scale squares compared direct evaluation similar results hold pairs scales shown figure test vertical case good horizontal variable
1 information theoretic approach rulebased connectionist expert systems rodney goodman john miller department electrical engineering caltech pasadena smyth communication systems research propulsion laboratories grove drive pasadena abstract discuss paper architectures executing probabilistic manner theoretical basis recently introduced informationtheoretic models begin describing learning algorithm theory quantitative rule modelling discussion exact nature models finally work approach database rules inference network compare networks performance theoretical limits specific problems introduction cheap mass storage devices common domains maintain large databases data telecommunications medicine question naturally arises extract models data automated manner models basis autonomous rational agent domain automatically generate systems data aspects problem learning model performing inference model propose paper hybrid approach learning ference combine qualitative knowledge representation ideas distributed computational advantages connectionist models underlying theoretical basis tied information theory knowledge repre sentation formalism adopt rulebased representation scheme supported cognitive researchers modeling higher level symbolic reasoning tasks recently developed informationtheoretic gorithm called optimal probabilistic rules data form neural learning backpropagation approach simply learning algorithm computationally direct understood backpropagation learning task finding infor individual rules reference collective properties performing inference model rules difficult problem exact theoretical schemes maximum entropy intractable realtime applications information theoretic approach expert systems investigating schemes rules represent links directed graph nodes correspond propositions pairs approach loosely connected multiple path arbitrary topology graph structures nodes performing local nonlinear decisions true state based supporting evidence priori bias fact recurrent neural network approach compared standard connectionist model learned algorithm difference lies semantics representation weights ratios based transformations probabilities possess clear user nodes explicit representation knowledge requirement system perform reasoning probabilistic conversely lack explicit knowledge representation current connectionist approaches black syndrome limitation application critical domains explanation criteria field learning model observations samples number items database sample datum terms attributes features assume values discrete alpha data form binary vectors requirement discrete continuousvalued attributes dictated nature rulebased representation addition impor tant note assume sample data exhaustive tendency neural network learning literature analyse learning terms learning boolean function truth table implicit assumption made samples good learning algorithm learn function depends feature representation problem interest hidden consequent nonzero bayes misclassification risk function dependent features unseen columns truth table artificial problems game playing perfect classification practical problems nature real features phenomenon statistical pattern recognition literature renders schemes simply perfectly classify training data simple model rule probability attributes random variables values respective discrete sample data earlier pose problem find data rules refer problem rule induction order distinguish special case deriving classification goodman miller smyth rules require preference rank rules learning algorithm preference measure find rules define information event yields variable based requirements nonnegative expectation respect equals average mutual information function defined recently shown possesses unique properties rule information measure general jmeasure average change bits required priori distribution posteriori distribution interpreted special crossentropy binary discrimination kullback distributions define average information content simply weights instantaneous rule information probability lefthand side occur rule fired definition motivated considerations learning rules environment rule high information content good predictor probability fired small interestingly definition possesses welldefined interpretation terms classical induction theory trading hypothesis simplicity hypothesis data algorithm jmeasure derive informative rules input data algorithm produces probabilistic rules ranked order decreasing information content parameter determined statistical significance test based size sample data algorithm searches space rules generality rules information theoretic bounds constrain search space model perform inference learned model lower order straints order joint distribution form probabilistic rules priori model typical inference situation initial conditions nodes clamped allowed measure state nodes possibly cost infer state probability goal propositions nodes evidence important note difficult general problem classification single fixed goal variable initial conditions goal propositions vary considerably problem instance infer ence problem determining posteriori distribution face incomplete uncertain information exact maximum entropy solution problem information theoretic approach expert systems tractable problem formulation stochastic relaxation techniques geman present impractical realtime robust applications motivation perform approximation exact inference robust manner mind developed models describe hypothesis testing network uncertainty network principles hypothesis testing network model consideration directed link assigned weight weight evidence idea necessarily interpretation approach previous work node assigned threshold term priori bias sigmoidal activation function based multiple binary inputs true conditionally independent write updating rule conditionally independent terms hypothesis test chosen true describes decision region independent measure ments evidence model interpreted distributed form hypothesis testing miller smyth principles uncertainty network model defined weight directed link threshold model interpret change bits posterjori positive support negative support activation multiple input links weighted activation functions interpreted total directional change bits required calculated locally node obtain average change bits dividing suitable temperature node make local decision recovering inverse jmeasure transformation sigmoid approximation inverse function experimental results conclusions section show rules generated data auto incorporated parallel inference network takes form multilayer neural network network perform parallel inference domain financial database mutual published statistical data approach typical real world domains figure shows portion typical data mutual line instance fund omitted column represents attribute feature fund attributes numerical categorical typical categorical attributes fund type reflect investment fund growth growth balanced growth typical numerical attribute year return investment expressed percentage total fund examples data data quantized examples produced serve input figure attributes binary values directly implemented binary neurons software processes table produce rules rules ranked order decreasing information jmeasure figure shows information theoretic approach expert systems portion rules output mutual fund data hypothesis test loglikelihood metric instantaneous jmeasure average shown rule transition probability order perform inference rules rules neural inference automatically gener network file loaded neural network simulator rule information metrics connection weights figure shows typical network derived rule output mutual data clarity connections shown architecture consists layers neurons input layer output layer activation range unit input layer unit output layer attribute mutual data output feeds back input layer layer synchronously updated output units considered hand sides rules receive inputs rules strength connection rules metric output units implement sigmoid activation function puts compute activation estimator hand side posteriori attribute input units simply pass output layer linear activation perform inference network probe vector attribute values loaded input output layers values clamped change unknown desired attribute values free change network feedback cycles converges solution read input output units evaluate models setup standard clas tests varying number nodes clamped inputs unclamped nodes priori probability relaxing network activation compared true attribute values sample order determine classification performance models trained randomly selected sets samples performance results table average classification rate models unseen samples bayes risk uniform loss matrix classification test calculated samples actual performance networks occasionally exceeded small sample variations cross table units clamped uncertainty test hypothesis test bayes risk goodman miller smyth conclude performance networks classifiers learned model data rulebased representation network performs slightly uncertainty model close estimated optimal rate bayes risk independence assumptions models hold coin term robust inference describe kind accurate behaviour presence incomplete uncertain information based encouraging initial results current research focusing higherorder rule networks extending theoretical understanding models nature acknowledgments work supported part grant bell program advanced technologies sponsored general general motors research paper carried propulsion laboratory california institute technology contract national aeronautics space administration john miller supported grant references goodman smyth information theoretic model rulebased expert systems presented international symposium information theory japan goodman smyth information theoretic rule induction proceed ings european conference publishing london goodman smyth deriving rules databases algorithm submitted publication pearl probabilistic semantics connectionist works ieee amount information ieee transactions information theory smyth goodman information content probabilistic rule submitted publication kullback information theory statistics york wiley smith inductive inference theory methods geman stochastic relaxation methods image restoration expert tems maximum entropy bayesian methods science kluwer academic publishers hinton sejnowski optimal perceptual inference ieee principles practice information theory addisonwesley reading american association individual mutual international publishing corporation chicago information theoretic approach expert systems fund type year beta bull risk perf perf bear stocks distri return sity ment asset rate balanced growth balanced expense total ratio mutual data type type type type year beta stocks turn distri bull bear return perf perf large high high high small high high small high large high high small high high high high figure quantized mutual data rule output mutual high large high high small high large small figure mutual rules weight output layer sigmoid units figure rule network
8 adaptive retina centersurround receptive field boahen computation neural systems california institute technology pasadena abstract vertebrate invertebrate retinas highly contrast independent background intensity decades rendered adaptation operating point background sity maintaining high gain transient responses center surround properties retina system extract formation edges image silicon retina models adaptation properties receptors center surround properties laminar cells invertebrate retina layer vertebrate retina spatiotemporal responses silicon retina moving bars chip pixels fabricated technology introduction observed previously initial layers vertebrate retina systems perform similar processing functions incoming input response versus intensity curves receptors vertebrate retinas similar curves show receptors larger gain illumination steady illumination receptors adapt adaptation property receptor respond large input range saturating anatomically eyes invertebrates differ greatly vertebrates adaptive retina centersurround receptive field simple eyes insects compound eyes compound consists consists photoreceptors receptors called single spectral class receptors provide channels wavelength discrimination vertebrate divided layer layer layer consists cones horizontal cells bipolar cells invertebrate receptors response increase light contrast vertebrate receptors increase light intensity vertebrate invertebrate receptors show light background illumination property retina maintain high transient gain contrast wide range background intensities invertebrate receptors project layer called layer layer consists primarily cells show similar response intensity curve vertebrate bipolar cells layer cells respond graded potentials illumination cells show high transient gain illumination ignoring background intensity possess centersurround receptive fields cones excited incoming light activate horizontal cells turn inhibit cones horizontal cells mediate lateral inhibition produces centersurround properties insects process lateral inhibition current flow photoreceptors cells surrounding modulation local field potential influence potential centersurround receptive fields contrasts surround computes local center signal previously silicon retina adaptive photoreceptors boahen recently compact currentmode analog model layer vertebrate retina analysed spariotemporal processing properties recent array photoreceptors adaptive photoreceptor circuit adapts operating point background intensity pixel shows high transient gain background illumination retina spatial coupling pixels pixels silicon retina compact circuit incor spatial temporal filtering light background intensity network exhibits centersurround behavior boahen currentmode retina draw analogy parts circuit cells layer analogy drawn silicon retina invertebrate retina function cells completely understood output responses retina circuit similar output responses photoreceptor cells invertebrates circuit details section spariotemporal processing performed retina stimulus moving speeds shown section boahen circuit figure onedimensional version retina equivalent circuit onedimensional version retina shown figure retina consists adaptive photoreceptor circuit pixel coupled controlled voltages output network obtained voltage output current output outputs properties obtained produces current proportional incident light logarithmic properties obtained operating feedback transistor shown figure subthreshold region voltage change output photoreceptor proportional small contrast oxide capacitance thermal voltage capacitance transistor circuit works increases output voltage increases amplifier gain output stage output change coupled capacitor adaptive retina centersurround receptive field ratio feedback transistor operates subthreshold region supplies current offset increase gate voltage current supplied increase node voltage back voltage level needed bias current transistor time figure figure shows output response receptor variation intensity light incident chip response shows high sensitivity receptor maintained decades differing background intensities numbers section curve intensity absolute intensity adaptive element curve hyperbolic sine small slope curve middle means small voltages element large voltage current exponential charged figure shows output response photoreceptor variation intensity results show circuit small contrast decades background steadystate voltage photoreceptor output varies details photoreceptor circuit adaptation properties spariotemporal response spariotemporal response network moving stimuli explored section circuit shown figure transferred equivalent network resistors capacitors shown figure obtain transfer function circuit capacitors node model boahen time time figure response pixel grey strip pixels wide graylevel dark background level moving past pixel speeds response pixel dark strip graylevel white background level moving past pixel speeds voltage shown curves direct measurement voltage drives transistor current sensed offchip current adaptive retina centersurround receptive field temporal responses circuit chip results experiments illustrate centersurround proper ties network difference surround center chip results data chip shown figures experiments pixel array rotating circular stimulus alternating contrasts mounted chip stimulus created figure shows spariotemporal impulse response pixel measured small strip level dark background level moving past pixels slow speeds impulse response shows centersurround behavior pixel receives inhibition preceding pixels excited stimulus stimulus moves pixel interest excited inhibited subsequent pixels stimulus time figure response pixel strip varying contrasts dark background moving past pixel constant speed faster speeds initial inhibition response grows smaller faster speed initial inhibition longer observed response inhibition surround constant center stimulus moves past pixel interest inhibition preceding pixels excited stimulus time inhibit pixel interest excitation inhibition place stimulus passes note figures figures curves displaced show pixel response speeds moving stimulus voltage shown curves direct measurement voltage drives transistor current sensed offchip current figure shows spariotemporal impulse response pixel similar boahen size strip level light background level moving past pixels inhibition behavior increasing stimulus speeds shows output response stimulus varying dark background level moving speed peak excitation response plotted contrast figure level corresponds level corresponds measurements mounted piece paper contrast measured varies exponentially increasing level conclusion paper adaptive retina centersurround receptive field system properties retina model functionally responses laminar cells invertebrate retina layer vertebrate retina show circuit shows adaptation decades background intensities centersurround property network spatiotemporal response stimulus speeds property serves remove redundancy space time input signal acknowledgements carver mead support encouragement supported fellowship boahen supported sloan fellowship delbriick inspiration testing design bradley minch comments fabrication provided mosis references coding efficiency design retinal processing vision springer berlin retinal resistance electrical lateral inhibition ture lond mahowald silicon retina adaptive photoreceptors symposium electronic science technology neurons chips april boahen andreou contrast sensitive silicon retina reciprocal synapses touretzky advances neural informa tion processing systems mateo morgan kaufmann boahen spatiotemporal sensitivity retina physical model memo california institute technology pasadena june delbriick analog vlsi adaptive logarithmic photoreceptor circuits memo california institute technology pasadena
3 adaptive range coding bruce rosen james distributed machine intelligence laboratory computer science department university california angeles angeles abstract paper examines class neuron based learning systems dynamic control rely adaptive range coding sensor inputs sensors assumed provide binary coded range vectors describe system state vectors input neuronlike processing elements output decisions generated neurons turn affect system state subsequently producing inputs reinforcement signals environment received intervals evaluated neural weights range boundaries determining output decisions altered goal maximizing future reinforcement environment preliminary experiments show promise adapting neural receptive fields learning dynamical control observed performance method exceeds earlier approaches adaptive range coding introduction major unsupervised learning control techniques barto barto albus albus priori selection region sizes range coding range coding principle generalizes inputs reduces computational storage overhead boundary partitioning determined priori ranges barto differ barto control task differ determination optimal adequate regions left additional task require system dynamics analyzed address problem move region boundaries adaptively progressively altering initial partitioning representation priori knowledge unlike previous work michie barto anderson fixed approach produces adaptive contract expand adaptation frequently active contract reducing number situations activated increasing neighboring regions receive input class selforganization discussed kohonen kohonen ritter resulting selforganizing mapping tend track environmental input probability density function adaptive range coding creates focusing mechanism resources distributed regional activity level resources allocated critical areas state space concentrated activity control decisions tuned dynamic shaping region boundaries achieved memory learning speed region boundaries finally determined solely environmental dynamics optimal priori ranges region specifications dimensional state space shown figures partitioned regions vertical lines shown heavy curve theoretical optimal control surface unknown priori state space weight region approximate dashed horizontal lines show learned weight values rosen respective weight values approximate true control surface weight regions weight state space weight state space figure region partition figure adapted region partition evenly partitioned space produces weights shown figure figure shows regions boundaries adjusted final weight values weights reflect true control surface respective regions adaptive partitioning represent ideal surface smaller squared error adaptive range coding rule general dimensional control problem adaptive range boundaries shape region change initial dimensional dimensional shape determined current activation state average activity heuristic adaptive range coding move region vertex current activation state reinforcement equation adjusts region boundary adapted part weight formula kohonens topological mapping kohonen region consists vertices describing regions boundaries move current state activity depending reinforcement gain reinforcement error alter weight region gaussian difference gaussians function adaptive range coding simulation results experiments expected reinforcement aseace system barto simple pole balancing figure chosen cartpole balancing task barto time step chosen large seconds initial region boundaries chosen parameters identical barto impulse impulse left figure pole balancing task standard aseace adaptive range coding algorithms compared task hundred runs algorithm performed consisted sequence trials trial counted number time steps pole fell pole time steps trial considered successful terminated terminated trials pole successfully balanced successive trials assumed successive trials systems weights regions stabilized region weights initialized start adaptive range coding runs updated vertex state positions determined factors difference vertex current state expected reinforcement gain gaussian served decay function modulate vertex movements current state vertex differences served function input parameters outputs rosen increasing inputs standard deviation gaussian shaped decay function magnitude position vertex movement modulated reinforcement moves vertex form current state gain parameter user parameter values initially chosen arbitrarily experiments parameters fine tuned optimized figure shows results aseace adaptive range coding experiments runs trials differed random number generator seed runs trials standard aseace adaptive range coding algorithm random number seed parameters identical systems adaptive range coding region boundaries shifted accordance successes success adaptive critic element associative adaptive search range element coding figure comparison aseace adaptive range coding algorithm adaptive range coding simulated runs algorithm successful runs aseace algorithm runs successful adaptive range coding algorithm runs successful test showed performance sets statistically figure shows comparison average performance values aseace adaptive range coding runs pole balancing time shown function number learning trials experienced pole balancing average performances time aseace trial number figure comparison aseace adaptive range coding learning rates cart pole task pole balancing time shown function learning trials results averaged runs disparity times algorithms comparatively large number failures aseace system statistical analysis significant difference learning rates performance levels runs categories leading adaptive range coding lead rosen behavior minimum area state space system explore succeed conclusion research shown neuronlike elements adjustable regions dynamically create topological effect maps reflecting control laws dynamic systems results examples presented adaptive range coding effective earlier static region approaches control complex systems unknown dynamics references albus brains behavior robotics mcgrawhill books anderson feature generation selection layered network reinforcement learning elements initial experiments technical report coins amherst university massachusetts department computer information science barto sutton anderson neuronlike elements solve difficult learning control problems coins tech amherst university massachusetts department computer information science barto sutton anderson neuronlike elements solve difficult learning control problems ieee transactions systems cybernetics kohonen associative memory york springerverlag michie machine edinburgh intelligence ritter schulten topology mappings learning motor tasks denker neural networks computing snowbird ritter schulten extending kohonens organizing mapping algorithm learn ballistic movements neural computers springerverlag
11 learning nonlinear dynamical systems algorithm zoubin ghahramani gatsby computational neuroscience unit university college london london httpwww abstract expectationmaximization algorithm iterative maximum likelihood parameter estimation data sets missing hidden variables applied system identification linear stochastic statespace models state variables hidden observer state parameters model estimated present generalization algorithm parameter estimation nonlinear dynamical systems tation step makes extended kalman smoothing estimate state maximization step parame ters uncertain state estimates general nonlinear maximization step requires integrating uncertainty states gaussian radial basis func tion approximators model nonlinearities integrals tractable maximization step solved systems linear equations stochastic nonlinear dynamical systems examine inference learning discretetime dynamical systems hidden state inputs outputs state evolves stationary nonlinear dynamics driven inputs additive noise characters indices denote vectors matrices represented characters ghahramani zeromean gaussian noise covariance outputs linearly related states inputs zeromean gaussian noise covariance vectorvalued nonlin assumed differentiable arbitrary models kind examined decades notably nonlinear statespace models form modern tems control engineering paper examine models framework probabilistic graphical models derive learning algorithm based exception knowledge paper addressing learning stochastic nonlinear dynamical systems kind framework algorithm classical approach system identification treats parameters hidden vari ables applies extended kalman filtering algorithm section nonlinear system state vector augmented parameters approach inherently online important applications estimate covariance parameters time step contrast algorithm present batch algorithm attempt estimate covariance parameters important advantages algorithm classical proach algorithm straightforward principled method missing inputs outputs generalizes readily complex models combinations discrete realvalued hidden variables formulate mixture nonlinear dynamical systems difficult prove analyze stability classical online approach algorithm attempting maximize likelihood acts lyapunov function stable learning sections describe basic components learning algorithm expectation step algorithm infer conditional distribution hidden states extended kalman smoothing section maximization step discuss general case section describe case nonlinearities represented gaussian radial basis function networks section extended kalman smoothing system equations infer hidden states history observed inputs outputs quantity heart inference problem conditional density captures fact system stochastic inferences uncertain gaussian noise assumption restrictive nonlinear systems linear systems nonlinearity generate nongaussian state noise authors aware tresp volume applied essentially model method multilayer perceptrons approximate nonlinearities requires sampling hidden states gaussian radial basis functions rbfs model nonlinearities analytically sampling section important extended kalman algorithm estimate parameters hidden states estimate hidden state part step learning nonlinear dynamics linear dynamical systems gaussian state evolution observation noises conditional density gaussian recursive algorithm computing covariance kalman kalman smoothing directly analogous forwardbackward algorithm computing conditional hidden state distribution hidden markov model special case belief propagation algorithm nonlinear systems conditional density general nongaussian fact complex multiple approaches exist inferring hidden state distribution nonlinear systems including sampling methods tional approximations focus paper classic approach engineering extended kalman smoothing extended kalman smoothing simply applies kalman smoothing local tion nonlinear system point derivatives vectorvalued functions define matrices dynamics linearized kalman filter state estimate time output equation similarly linearized prior distribution hidden state gaussian linearized system conditional distribution hidden state time history inputs outputs gaussian kalman smoothing linearized system infer conditional distribution figure left panel learning step algorithm parameters observed inputs outputs conditional distributions hidden states model parameters define nonlinearities noise covariances complications arise step computationally sible fully reestimate represented neural network single full step lengthy training procedure backpropagation conjugate gradients optimization method alter partial steps consisting gradient steps complication trained uncertain state estimates output algorithm fitting takes inputs outputs conditional density estimated gaussian data points mixture gaussians inputoutput space gaussian data integrating type noise nontrivial form simple inefficient approach bypass problem draw large sample gaussian uncertain data samples usual similar situation occurs section show choosing gaussian radial basis functions model complications vanish forward part kalman smoother kalman filter ghahramani fitting radial basis functions gaussian present general formulation network clear special forms nonlinear mapping input vectors output vector zeromean gaussian noise variable covariance form represented parameters coefficients rbfs matrices multiplying inputs output bias vector assumed gaussian center width covariance matrix goal model data complication data form mixture gaussian distributions show analytically integrate mixture distribution model assume data rewrite slightly notation brackets denote expectation defining objective written observe samples variables paired gaussian data gaussian covariance matrix parameters likelihood single data point model const maximum likelihood mixture gaussian data obtained minimizing integrated quadratic form learning nonlinear dynamics taking derivatives respect setting linear equations solve words expectations brackets optimal parame ters solved linear equations appendix show expectations computed analytically derivation intuition simple gaussian rbfs multiply gaussian densities form unnormalized gaussians expectations gaussians easy compute fitting algorithm illustrated panel figure gaussian evidence gaussian evidence gaussian evidence inputs outputs input dimension figure steps algorithm left panel shows information extended kalman smoothing hidden state distribution estep panel illustrates regression technique employed mstep mixture gaussian densities required gaussian networks solved analytically dashed line shows regular centres gaussian densities solid line shows analytic covariance information dotted lines show support kernels results tested algorithm learn dynamics nonlinear system observing inputs outputs system consisted single input state output variable time relation state time step tanh nonlinearity sample outputs system response white noise shown figure left panel initialized nonlinear model linear dynamical model trained turn initialized variant factor analysis model rbfs uniformly spaced range automatically determined density points initialization algorithm discovered sigmoid nonlinearity dynamics iterations figure middle panels experiments determine practical method real domains ghahramani figure left data training half testing rest consists time series inputs outputs middle representative plots likelihood iterations linear dynamical systems dashed line nonlinear dynamical systems trained paper solid line note actual likelihood nonlinear dynamical systems generally computed analytically shown approximate likelihood computed solid curve initialization linear dynamics ends nonlinearity starts learned means gaussian posteriors computed dots sigmoid nonlinearity dashed line nonlinearity learned algorithm point algorithm observe pairs inferred inputs outputs current model parameters discussion paper brings classic algorithms statistics systems engineering address learning stochastic nonlinear dynam ical systems shown pairing extended kalman smoothing algorithm state estimation estep radial basis function learning model permits analytic solution mstep algorithm capable learning nonlinear dynamical model data side effect derived algorithm training radial basis function network data form mixture gaussians initial approach potential limitations mstep presented modify centres widths kernels compute expectations required change centres widths requires resort partial mstep dimensional state spaces space kernels feasible strategy exponentially rbfs high dimensions training slow initialized poorly understanding hidden variable models related devise initialization heuristics model nested learned simple linear dynamical system turn initialized variant factor analysis method presented learns data assumes stationary dynamics recently extended handle online learning nonstationary dynamics belief network literature recently dominated methods approximate inference markov chain monte carlo variational approxima tions knowledge paper instance extended kalman smoothing perform approximate inference step theoretical guarantees variational methods gained wide acceptance estimation control method inference nonlinear dynamical systems exploring generalizations method learning nonlinear multilayer belief networks learning nonlinear dynamics acknowledgements acknowledge support ontario gatsby char fund supported part center neuromorphic systems engineering nserc canada award expectations required rbfs expectations compute starting easier equation depend kernel observe multiply gaussian kernel equation gaussian density covariance extra constant lack normalization evaluate expectations finally references tresp fisher scoring mixture modes approach inference learning nonlinear state space models volume press dempster laird rubin maximum likelihood incomplete data algorithm royal statistical society series jordan ghahramani jaakkola saul introduction variational methods graphical models machine learning kalman results linear filtering prediction journal basic engineering ljung theory practice recursire identification press cambridge moody darken fast learning networks locallytuned processing units neural computation neal probabilistic inference markov chain monte carlo methods technical report linear smoothing problem ieee transactions automatic control approach time series smoothing forecasting algorithm time series analysis
7 stochastic dynamics threestate neural networks computer science laboratory tokyo japan jack cowan mathematics neurology university chicago chicago abstract present analysis stochastic neurodynamics neural network composed threestate neurons master equation outerproduct representation equation employed representation extension analysis threestate neurons easily performed apply formalism approximation schemes threestate network compare results monte carlo simulations introduction studies single neurons networks influence noise item neural network modelling analogy spin systems finite produced important results networks twostate neurons studies networks threestate neurons limited master equation intro duced cowan study stochastic neural networks equation formalism quantization classical systems study networks twostate neurons cowan paper master equation outerproduct representation operators extend previ analysis networks threestate neurons hierarchy moment equations networks derived approximation schemes obtain equa jack cowan figure transition rates threestate neuron tions macroscopic activities model networks compare behavior solutions equations monte carlo simulations basic neural model introduce network master equation network cowan neurons site site assumed cycle states quiescent activated refractory labelled transitions functions neural input current assume smoothly increasing functions input current denoted transition rates defined constants resulting stochastic transition scheme shown figure assume transition rates depend current state network past states neural state transitions asynchronous assumption essential master equation description model represent state neuron threedimensional basis notation correspond standard vector notation define product states states configurations network represented direct product space neuron probability finding network state time introduce neural state vector neurons network stochastic dynamics threestate neural networks network states definitions write master equation network transition rates shown figure outerproduct representations sakurai note relations master equation takes form evolution equation network average number connections neuron weight neuron weights normalized respect average number connections neuron master equation introduced cowan matrices cowan note outerproduct representation extend description threestate neurons including basis vector analogy analysis twostate neurons introduce state vector product direct product parameters introduce point moments probability neuron active quiescent refractory time define multiple moment probability neuron active neuron quiescent neuron refractory time shown jack cowan hierarchy moment equations obtain equation motion moments typical case problems obtain analogue hierarchy equations definition moments master equation state vector show hierarchy order note parameters eliminated note equations coupled higher orders leads approximation schemes terminate hierarchy order introduce moment level approximation schemes simplicity special case linear equal moment field approximation leads stochastic dynamics threestate neural networks obtain moment approximation note moment dynamics obtained approximation differs obtained moment approximation section briefly examine difference comparing approximations monte carlo simulations jack cowan comparison simulations section compare moment approximations monte carlo simulation dimensional ring threestate neurons studied previous publication cowan twostate neurons shown threestate neuron ring interacts neighbors precisely operator define dynamical variables interest network moment approximation moment approximation stochastic dynamics threestate neural networks monte carlo simulations ring neurons performed compared moment approximation predictions fixed parameters figure comparison monte carlo simulations dots moment dashed line moment solid line approximations state case fraction total active refractory state variables graph labeled values varied sampled numerical dynamics parameters comparisons shown figure time dependence total number active refractory state variables improvement moment level approximation simulations parameter ranges remain explored conclusion introduced neural network master equation outerproduct representation representation extension threestate rons transparent advantage natural extension analyse threestate networks calculations involved jack cowan obtained results indicating moment level approximation accurate moment level approximation note twostate case moment level approximation produces activation simulation analytical theoretical investigations needed fully dynamics threestate networks master equation introduced acknowledgements work supported part robert fellowship university chicago part grant department office naval research references cowan stochastic neurodynamics advances neural information processing systems touretzky lippman moody morgan kaufmann publishers mateo quantization representation classical phys math stochastic theory reactions phys math methods identical classical jects information processing threestate neural networks stat phys cowan approach stochastic neurodynamics phys cowan diagrams stochastic neurodynamics proceedings australian conference neural networks sakurai modern quantum mechanics menlo park
1 mapping classifier systems neural networks lawrence davis laboratories systems technologies corporation street cambridge january abstract classifier systems machine learning systems incorporating genetic gorithm learning mechanism respond inputs neural networks respond structure representation learning mechanisms differ employed neural network sorts domains result conclude types machine learning intrinsically papers prove classifier systems neural networks equivalent paper half equivalence demonstrated description transformation procedure classifier systems neural networks isomorphic behavior paradigms employed neural network researchers required order make transformation work noted discussed paper concludes discussion practical results comments introduction classifier systems machine learning systems developed john holland recently members genetic algorithm research community classifier systems genetic algorithms algorithms optimization learning genetic algorithms employ techniques inspired process biological evolution order evolve paper discussions rich sutton williams wilson david members boston area research group genetic algorithms inductive networks individuals solutions problems optimizing function traversing maze explanation genetic algorithms reader referred goldberg classifier systems receive messages external source inputs genetic algorithm learn produce responses internal interaction external source paper papers exploring question formal relationship classifier systems neural networks employed sorts algorithms distinct procedure translating operation neural networks isomorphic classifier systems technique include conversion neural network learning procedure classifier system framework appears technique support conversion conjecture sorts machine learning systems employ learning techniques relationship result suggests classifier systems neural networks reverse conclusion suggested consideration inputs sort learning algorithm processes viewed black boxes mechanisms learning receive inputs carry procedures produce outputs class inputs traditionally processed classifier systems class strings fixed length subset class inputs traditionally processed neural networks appears classifier systems operate subset inputs neural networks process viewed mechanisms modify behavior fact correct translate classifier systems neural networks preserving learning behavior translate neural networks classifier systems preserving learning behavior order sort algorithm made paper deals translation classifier systems neural networks neural networks required order translation place reverse translation techniques treated davis sections description classifier systems description transformation operator discussions proof comments issues raised proof conclusions classifier systems classifier system operates context environment sends messages system reinforcement based behavior displays classifier system components message list population entities called classifiers message message list composed bits mapping classifier systems neural networks pointer source messages generated environment classifier classifier population classifiers components match made characters dont care message made characters strength description classifier system population production rules attempt match condition message list classifying input post message message list potentially affecting environment classifiers reinforcement environment classifier system modify strengths classifiers periodically genetic algorithm invoked create classifiers replace members classifier explanation classifier systems potential machine learning systems formal properties reader referred holland processing stages precisely classifier system operates cycling fixed list procedures order procedures message list processing clear message list post environmental messages message list post messages message list classifiers post previous cycle implement environmental reinforcement analyzing messages message list altering strength classifiers post previous cycle form determine classifiers match message message list classifier matches message match field matches message matches matches matches matching classifiers forms current implement subtracting portion strength classifier strength strength classifier classifiers messages matched prior step form post larger maximum post size choose classifiers stochastically post weighting proportion magnitude classifiers chosen post reproduction reproduction generally occur cycle occur steps carried create children parents crossover andor mutation choosing parents stochastically favoring strongest crossover mutation operators genetic algorithms strength child equal average strength parents note ways strength classifier transformation work analogous ways remove members classifier population children classifier population mapping classifiers classifier networks mapping operator describe maps classifier classifier network classifier network links environmental input units links classifier networks match post message units weights links leading match node leaving post node related fields match message lists classifier additional link added provide bias term match node note assumed environment message cycle modifications transformation operator accommodate multiple environmental messages final comments paper classifier system classifiers matching sending length construct isomorphic neural network composed classifier networks classifier construct classifier network composed match nodes post node message nodes match node environmental match node links inputs environment match nodes linked message post node classifier network reader referred figure transformation match node classifier network incoming links weights links derived applying transformation elements match field weight weight weight weight final link number links weight classifier match field network weights links leading match nodes classifier match field weights weights links message node classifier network equal element classifiers message field message field classifier weights links leading message nodes classifier network weights links classifier network node classifier network threshold function determine acti vation level match nodes thresholds nodes thresholds nodes threshold exceeded nodes activation level classifier network quantity called strength altered network processing cycle cycle processing classifier system maps cycle processing classifier networks message list processing compute activation level message node environment supplies reinforcement cycle divide reinforcement number post nodes active environment message preceding cycle quotient strength active post nodes classifier network message cycle environment environment nodes node node turn final environmental node environmental message turn environmental mapping classifier systems neural networks nodes form compute activation level match node classifier network compute activation level node classifier network classifier networks active node subtract fixed proportion strength classifier network amount strength networks connected active match node strength environment passes system form post larger maximum post size choose networks stochastically post weighting proportion magnitude networks chosen post viewed stochastic procedure reproduction cycle reproduction occur classifier system carry analog neural network create children parents crossover andor mutation choosing parents stochastically favoring strongest alphabet composed classifier alphabet operator applied final member match list write weights match links message links classifier networks match weights children choose networks stochastically chosen strength classifier network average strengths parents simple show classifier network match node match message cases classifier matched message cases original match character matched message link weight state node affect activation match node original match character message message matched link weight inspection weight final link match node threshold fact type link positive weight link weight connected active node match node activated finally link weight links connected node active effect turning node connected link weight match node inactive correspondence matching behavior verify classifier networks classifier system properties cycle processing classifier system classifier cases network active node assuming systems technique initialized classifier post cases network post finally parents chosen reproduction chosen classifier system children produced transformations classifier system parents systems isomorphic operation assuming random number generator davis classifier network strength classifier network strength message nodes post nodes match nodes environment input nodes figure result classifier system classifiers neural network classifier match message strength match message strength mapping classifier systems neural networks concluding comments transformation procedure classifier system neural network operates points raised techniques accomplish mapping closing excess complexity classifier networks shown fact eliminate match nodes links determine classifier network matches message classifier network system introduce link directly post node classifier network post node network match nodes environment long predict messages environment post message nodes long messages environment incoming links eliminated simplifications introduced extensions discussed require complexity current architecture genetic algorithm side classifier system considered extremely simple extensions classifier system researchers handled expanded mapping procedures modifications architecture classifier networks give indication modifications sample cases case environment produce multiple messages cycle handle multiple messages additional link added environmental match node weight match nodes threshold link latch match node additional match node links environment nodes added counting node attached architectural modifications cycle modified message matching cycle series carried message environment environmental message input environmental match node computes activation environmental match nodes active matched environmental message count nodes record matched classifier network paid classifier network messages matched number environmental messages matched recorded count node number messages matched finally weights written classifier networks links written match node connected count node sort complication bits bits passed message matched message sort mechanism implemented obvious fashion structure classifier network similar complications produced matching negation messages open question cases handled modifying architecture mapping operator found handled davis classifier networks sigmoid activation functions port hillclimbing techniques recurrent networks strict feedforward networks fact carry transformations affect behavior researchers field point greater length paper conclusion techniques neural network domain mapping improve performance networks include tracking strength order guide learning process genetic operators modify network measurements order determine aspects network reproduction reader referred davis benefits gained employing techniques finally proof intended view proof proof suggest exciting ways learning techniques field approach successful application realworld problem characterized davis references richard michael back propagation classifier system preparation davis lawrence mapping neural networks classifier systems international conference genetic algorithms goldberg david genetic algorithms search optimization machine learning addison wesley holland john richard paul induction press david lawrence davis training feedforward neural works genetic algorithms submitted international joint conference artificial intelligence
6 learning curves asymptotic values rate convergence cortes jackel solla vapnik john denker bell laboratories holmdel abstract training classifiers large databases computationally demand desirable develop efficient procedures reliable prediction classifiers implementing task resources assigned promising candidates exploring classifier candidates propose practical principled predictive method practical avoids costly procedure training poor classifiers training principled theoretical foundation effectiveness proposed procedure demonstrated single multilayer networks introduction training classifiers large databases computationally demanding desirable develop efficient procedures reliable prediction classifiers implementing task describe practical principled predictive method procedure applies situations huge databases limited sources classifier selection poses problem training requires resources explosion classifier candidates training classifiers full database resources finding classifier suitable task requires search strategy cortes jackel solla vapnik denker test error training size figure test errors function size training classifiers classifier choice based test error size result inferior classifier choice full database naive solution tile resource dilemma reduce tile size database feasible train classifier candidates performance classifiers estimated independently chosen test training makes point classifier plot test error function size training naive search strategy classifier assumption relative ordering classifiers unchanged test error reduced size full database size assumption questionable easily result inferior classifier choice illustrated predictive method utilizes extrapolation medium sizes large sizes training based data points obtained sizes training intermediate size regime computational cost training change representation measured data points gain confidence extrapolation predictive method predictive method based simple modeling tile learning curves classifier learning curves expectation test training errors function training size expectation ways choosing training size typical learning curves shown tile test error larger training error asymptotically reach common model errors large sizes training powerlaw decays learning curves asymptotic values rate convergence error test error training error training size figure learning curves typical classifier finite values training size test error larger training error asymptotically converge asymptotic error size training positive exponents expressions difference formed strain make assumption equation reduce expressions suggest representation difference test training errors training size resulting straight lines large sizes training constant straight line slope intersection difference shown assumption equal convergent terms crucial simplification model find experimentally classifiers approximation hold difference forms straight line froin line extracted intersection weighted cortes jackel solla vapnik denker size figure validity powerlaw modeling test training errors difference errors function training size give straight lines constant straight line slope intersection difference give constant choice validity model tested numerous boolean classifiers linear decision surfaces experiments found good agreement model extract reliable estimates parameters needed model learning curves asymptotic power amplitude powerlaw decay shown left considered task separation handwritten digits digits problem unrealizable database classifier simple modeling test training errors equation assumed hold large sizes training appears valid intermediate sizes left predictive model suggested based observation illustrated left test training errors measured estimate straight lines extract approximate values parameters characterize learning curves resulting extrapolate learning curves full size database algorithm predictive method measure intermediate sizes training plot strain strain versus estimate straight lines extract asymptotic amplitude exponent extrapolate learning curves full size database learning curves asymptotic values rate convergence error error training error training size points prediction predicted learning curves figure left test model dimensional boolean classifier trained squared error difference test training errors shown function normalized training size base point standard deviation choices training size straight line decay shown reference prediction learning curves dimensional boolean classifier trained minimizing squared error measured errors training size proposed straight lines plot parameters characterize learning curves extracted extrapolation prediction boolean classifier linear decision surface illustrated prediction excellent type classifiers difference test training errors converge quickly straight lines linear decision surfaces general adequate applications usefulness predictive method proposed judged formance sophisticated multilayer networks demonstrates validity model fullyconnected multilayer network operating nonlinear regime implement unrealizable digit recognition task intermediate sizes training difference test training errors observed follow straight lines predictive method finally tested sparsely connected multilayer works left shows test training errors networks trained recognition handwritten digits network termed commonly referred network termed modification additional feature maps full size database patterns cortes jackel solla vapnik denker test figure test model fullyconnected network difference test training error shown function normalized training size point standard deviation choices training size mixture nist training test sets training patterns obvious network perform network trained full database quantify expected improvement predictive method good quantitative estimate networks test error patterns decide weeks training devoted architecture based datapoints network result values parameters determine extrapolate learning curves network full size database illustrated predicted test error full size database half test error architecture strongly suggest performing training full database result full training good agreement predicted measured values illustrates power applicability predictive method proposed applications theoretical foundation proposed predictive method based powerlaw modeling learning curves heuristic fair amount theoretical work framework statistical mechanics compute learning curves simple classifiers implementing unrealizable rules nonzero asymptotic error assumption theoretical approach number weights network large institute standards technology special database learning curves asymptotic values rate convergence error network network figure error training size network network predicted training size left test circles training triangles errors networks work commonly referred network termed modification additional feature maps full size database patterns mixture nist training test test circles training triangles errors network figure shows predicted values learning curves range train patterns network measured values patterns statistical mechanical calculations support symmetric powerlaw decay expected test training errors common asymptotic power laws describe behavior large regime exponent falls interval numerical observations modeling test training errors agreement theoretical predictions observed correlation exponent error accounted theoretical models considered shows plot exponent versus asymptotic error evaluated tasks appears data difficult target rule smaller exponent slower learning larger generalization error intermediate training sizes cases combined effect larger asymptotic error slower convergence numerical results classi smaller larger input dimension support explanation correlation finite size input dimension classifier summary paper propose practical principled method predicting suit ability classifiers trained large databases procedure eliminate cortes jackel solla vapnik denker exponent asymptotic error figure exponent extracted powerlaw decay function asymptotic error tasks tasks characterized asymptotic error changed tuning strength constraint norm weights classifier poor classifiers early stage training procedure intelligent computational resources method based simple modeling expected training test errors expected valid large sizes training model measures assumed follow powerlaw decays common asymptotic error exponent amplitude characterizing powerlaw convergence validity model tested classifiers linear linear decision surfaces free parameters model extracted data points obtained medium sizes training extrapolation good estimates test error large size training numerical studies learning curves revealed correlation exponent powerlaw decay asymptotic error rate correlation accounted existing theoretical models subject continuing research references boser denker henderson howard hubbard jackel handwritten digit recognition backpropagation network advances neural information cessing systems volume pages morgan kaufman seung sompolinsky tishby statistical mechanics learning examples physical review
1 genesis system simulating neural networks wilson john james bower division biology california institute technology pasadena abstract developed graphically oriented general purpose simulation system facilitate modeling neural networks simulator implemented unix designed support simulations levels detail specifically intended applied network modeling simulation detailed realistic biologically based models examples current models developed system include mammalian olfactory bulb cortex invertebrate central pattern generators abstract connectionist simulations introduction recently dramatic increase interest exploring computational properties networks parallel distributed processing elements rumelhart referred neural networks anderson current research involves numerical simulations types networks anderson touretzky years significant increase interest similar computer simulation techniques study structure function biological neural networks effort attempt brain objective understanding functional organization complicated networks bower simulations systems range detailed reconstructions single neurons components single neurons simulations large networks complex neurons koch area research benefit exposure large range neural network simulations simulation package capable implementing varied types network models facilitate interaction wilson bower design features simulator built genesis general network simulation system graphical interface output display utility provide standardized flexible means constructing neural network simulations making minimal assumptions actual structure neural components system capable growing users incorporating code describe specific features system device independence entire system designed unix version maximum code developed workstations computers supporting unix addition developing parallel implementation simulation system nelson modular design design simulator interface based approach simulations constructed modules receive inputs perform calculations generate outputs figs approach central generality flexibility system user easily features modification base code interactive specification control network specification control high level graphical tools network specification language graphics interface highest user level interaction consists number tools user suit simulation graphical interface user display control adjust parameters simulations network specification language developed network modeling represents basic level interaction language consists simulator interface functions executed text files storing command sequences scripts language arithmetic operations program control functions conditional statements figures demonstrate functions simulator interface consist module graphical tools simulator base code provide routines modules construct specific simulations base code common control support routines entire system genesis system simulating neural networks graphics interface files genesis command language genesis figure levels interaction simulator constructing simulations step genesis involves selecting linking modules simulation additional commands language establish network graphical interface module classes modules genesis divided computational modules communications modules graphical modules instances computational modules called elements central components simulations performing numerical calculations elements communicate ways links connections links passing data elements time delay computation performed data links serve large number elements single computational unit link elements form neuron connections hand interconnect computational units simulated communication channels incorporate time delays perform transformations data transmitted axons graphical modules called construct interface modules issue commands respond allowing interactive access simulator structures functions wilson bower hierarchical organization order track structure simulation elements organized tree hierarchy similar structure unix tree structure explicitly represent pattern links connections elements simply tool organizing complex groups elements simulation simulation types modules process structuring network simulation graphical interface describe construction simple biological neural simulation consists neurons neuron passive dendritic compartment active cell body axonal output synaptic input dendrite axon neuron connects synaptic input figure shows basic structure model implemented genesis model synapse channels simulator interface graphics modules computational modules communications modules simulation simulator figure stages constructing simulation genesis system simulating neural networks dendrite network dendrite axon synapse element figure implementation neuron model genesis schematic gram modeled neurons cell simple model dendritic compartment active output axon synaptic input dendrite cell ionic channels cell body hierarchical representation components simulation maintained genesis neuron referred representation functional links basic components neuron sample interface control display created wilson bower dendritic compartments cell body axon treated separate computational elements links elements share information channel access membrane voltage figure shows portion construct simulation create types elements assign names create create create dendrite create synapse establish functional links elements link dendrite link dendrite parameters elements dendrite capacitance make copies entire element subtrees copy establish connections elements connect graph monitor element variable graph potential make control panel control control default default figure sample commands constructing simulation simulator specifications memory requirements genesis genesis consists lines simulator code similar amount graphics code written rough estimate amount additional memory simulation calculated sizes number modules simulation typically elements bytes connections messages genesis system simulating neural networks performance efficiency genesis system highly simulation specific briefly specific case sophisticated biologically based simulation implemented genesis model pitiform olfactory cortex wilson wilson bower wilson bower simulation consists neurons types neuron compartments compartment channels simulation cells runs time step models implemented genesis list projects completed genesis includes approximately simulations include models olfactory bulb inferior olive bower motor circuit invertebrate ryckebusch built students explore compartmental biological models hodgkin huxley hopfield networks hopfield genesis genesis made cost distribution interested users modules linked simulator extend system users encouraged support continuing development system sending modules develop caltech reviewed system genesis support hope users send completed published simulations genesis data base provide opportunity observe behavior simulation hand current modules full simulations maintained electronic mail system genes acknowledgments mark nelson invaluable assistance development system specifically suggestions content manuscript recognize dave christof koch caltech students students summer methods computational neuroscience contributions creation evolution genesis mutually exclusive research supported contract corporation caltech fund development fund joseph foundation wilson bower references anderson neural information processing systems american institute physics york wilson bower integration computer simulations multiunit recording olfactory system neurosci abstr bower reverse engineering nervous system anatomical physiological computer based approach introduction neural electronic networks davis editors academic press press hodgkin huxley quantitative description membrane current application conduction excitation nerve lond hopfield neural networks physical systems emergent collective computational abilities proc natl acad koch segev methods neuronal modeling synapses networks press cambridge press bower structural simulation inferior olivary nucleus neurosci abstr nelson bower simulating neurons neuronal networks parallel computers methods neuronal modeling synapses networks koch segev editors press cambridge press ryckebusch mead bower modeling central pattern generator software hardware moss cmos volume rumelhart mcclelland research group parallel distributed processing press cambridge advances neural network information processing systems morgan kaufmann publishers mateo california wilson bower simulation largescale neuronal networks methods neuronal modeling synapses networks koch segev editors press cambridge press wilson bower computer simulation olfactory cortex functional implications storage retrieval olfactory information neural information processing systems anderson editor published press york wilson bower haberly computer simulation pitiform cortex neurosci abstr part structured networks
10 training neural networks small squared errors department mathematics yale university abstract demonstrate problem training neural networks small average squared error computationally intractable data points input vectors real outputs work class neural networks ative error occurs data prove classes neural networks achieving rela tive error smaller fixed positive threshold independent size data nphard introduction data input vectors real outputs call points data points training problem neural networks find network class fixed number nodes layers fits data small error describe problem details class neural networks metric norm associate error vector depends data prefer notation avoid difficulty norm shows network fits data norm denote smallest error achieved network context training problem find positive number advance depend size data call relative error norm chosen nature training process common norms norm interpolation problem norm square error prob referred error training process goal paper show achieving small error nphard work norm question great importance data advance find algorithm solve training problem formulated algorithm polynomial time polynomial size input question closely related problem learning neural networks polynomial time input algorithm data size means number bits required write question find algorithm produces question answer general paper investigate question achieve arbitrary small relative error polynomial algo rithms purpose give negative answer question question posed jones yale crucial point dealing norm important statistical point view investigation inspired works show negative results norm case definition positive number threshold class neural networks training problem networks relative error nphard computationally infeasible order provide negative answer question show existence thresholds independent size data classes networks positive integer positive vectors real numbers positive clear class reason distinguish cases proof easy present important ideas class functions sigmoid functions satisfy conditions details main theorem classes absolute constant positive thresholds training neural networks small squared errors class threshold form threshold form class threshold form statements absolute positive constant argument proof assume algorithm solves training problem class relative error properly chosen nphard problem construct data sufficiently small solution found constructed data input implies solution original nphard problem give lower bound assume algorithm polynomial proofs leading parameter dimension data inputs polynomial polynomial variable input data sets constructed polynomial size paper organized follow section discuss earlier results norm section display nphard results reduction section prove main theorem class mention method handle general cases conclude remarks open questions section section mention important corollary main theorem implies learning respect norm hard connection complexity training learning problems refer notation paper denotes unit hypercube number denotes vector length denotes origin half space complement number elements function order magnitude positive constants previous works case case interpolation problem considered authors classes networks authors investigate case perfect authors proved training networks step function nodes relative error nphard proof extended networks nodes logistic output nodes generalized result data rational inputs combining techniques analysis arguments jones showed training problem relative error networks monotone sigmoid nodes linear output node nphard npcomplete circumstances implies threshold sense definition class examined threshold weak decreasing result extended nodes case interesting compare results considered problem network training examples data exist weights network correct output training examples proved problem nphard network required produce correct output examples fact shown class networks data sets algorithm produce poorly networks data sets class result network hard train algorithms number nodes networks grows size data sense result independent size data proofs exploit techniques provided works crucial reduction blum rivest involves problem hard problems definition formula clause literals maximum number clauses satisfied truth assignment problem find truth assignment satisfies clauses theorem approximation problem hard small theorem finding truth assignment satisfies clauses nphard problem hard literal appears clauses clause literals denote class literals clause literal appears clauses theorem finding truth assignment satisfies clauses formula nphard optimal thresholds theorems computed recent results computer science space limitation matter edges collection subsets elements called vertices degree vertex number edges vertex assume edge vertices color vertices color blue edge vertices colors call maximum number edges achieve probabilistic argument easy show random edge probability prove theorem proof theorem constant finding edges nphard statement holds case degree proof follow reduction theorem exception vertex vertices degree edge vertices number edges training neural networks small squared errors unit vector edge maximum number edges denote edges edges data inputs makes difference means repeated times data resp similarly vectors blue verify function step fits data perfectly suppose step satisfies previous inequality implies ratio called misclassification ratio show ratio arbitrary small order avoid unnecessary floor symbols assume integer choose assume classified space consisting note denote note points misclassified color rule color arbitrarily blue blue proof claims claim edge left readers verify simple statement half degree claim close notice observe size data hand obtain yields choose theorem threshold class completes proof space limitation omit proofs classes refer describe roughly general method handle cases method consists steps extend data previous proof special points special points sufficiently high points choose special points properly fact points determine roughly behavior nodes general show nodes influence outputs data points problem basically reduces case nodes modifying previous proof achieve desired thresholds remarks open problems readers argue existence natural data points high avoid data points combinatorial trick proof section carried theorem prefer terminology theorem convenient standard theorem interesting listed approximation hard theorems remains open question determine order magnitude thresh classes considered section technical reasons main theorem thresholds nodes involve dimension conjecture thresholds acknowledgement blum barron ideas discussions references approximation book chapter preprint blum rivest training neural network nphard neutral works blumer ehrenfeucht haussler warmuth learnability dimension journal association computing johnson computers guide theory francisco training neural networks small squared errors generalizing model neural learning applications tech santa cruz university california jones computational training sigmoidal neural networks preprint judd neutral networks learning press polyhedral separability tech research center jose training neural networks small squared error manuscript
8 correlated neuronal response time scales mechanisms bair howard hughes medical inst center neural science washington room york dept neurobiology institute life sciences hebrew university jerusalem israel christof koch computation neural systems caltech pasadena abstract analyzed relationship correlated spike count peak crosscorrelation spike trains pairs recorded neurons previous study area macaque monkey conclude common input responsible creating peaks order milliseconds wide spike train responsible creating correlation spike count served time scale trial argue common excitation inhibition play significant roles establishing correlation introduction previous study pairs neurons recorded single extracellular electrode found spike count seconds visual motion stimulation average correlation coefficient relation significantly limit usefulness pooling increasingly large populations neurons correlated spike count tween neurons principle occur correlated correlated neuronal response time scales mechanisms excitability cells normal biological electrode induced correlation time scale alternatively attentional priming effects higher areas change cells time scale experimental trial common input order milliseconds correlation spike count section determines time scale neurons correlated analyzing relationship peak spike train correlation spike counts construct call trial section examines temporal structure indicative correlated suppression firing inhibition contribute spike count correlation time scale correlation time scale single trial correlation spike counts neurons recorded identical stimuli computed correlation coefficient expected variance spike counts converted unity variance interpreted crosscorrelation spike counts trial resulting procedure shown pairs neurons distinguish cases shown correlation broken longterm component average computed gaussian window standard deviation trials surrounding short term component difference pairs neurons monkeys average significantly similar correlation reported assumptions including time scale correlation trial duration estimated area spike train areas derivation omitted additional assumption spike trains individually poisson peak autocorrelation occurs definition correlation coefficient spike count estimated firing rates neurons area area spike train peak shown pair neurons taking area area msec good estimate shortterm shown addition strong correlation noisy measure standard deviation shown average fourth large conclude common input peaks spike train responsible correlation spike count previously reported bair koch neuron neuron trial number trial number trials trials figure normalized responses pairs neurons trial cross upper traces show spike counts trials order occurred spikes counted stimulus trials occurred average trials represents lower traces show trial pair cells left panel experiment lower left shows drift correlated neurons trials pair cells panel trial shows strong correlation simultaneous trials measured correlation coefficient trial occur long time scale left short time scale equal trial broken components short term long term text shortterm component minus weighted average surrounding times left correlated neuronal response time scales mechanisms width msec time msec figure spike train central peak frequency histogram widths shown inset cell pairs monkeys area central peak measured msec predict correlation plotted probability coincidence relative expected poisson processes measured firing rates short term figure area peak spike train yields prediction strongly correlated shortterm spike count correlation absence points lower corner plot cases pair cells strongly correlated peak spike train bair koch pairs neurons shortterm correlation peak msec range spike train correlated suppression doubt common excitatory input peaks shown results correlated spike count time scale trial observed correlated periods suppressed firing point inhibition contribution peaks correlated spike count show response neuron coherent preferred null direction motion long interspike intervals isis gaps response preferred motion bursts response null motion database single neurons previous study occurrence gaps bursts symmetrical time prominent average msec onset substantial variations cell cell bair gaps roughly msec long consistent slow steady adaptation potassium currents observed current injection neocortical pyramidal neurons neurons shows spike trains simultaneously recorded neurons stimulated preferred direction motion longest gaps occur time assess correlation transform spike trains interval trains shown spike trains emphasizes presence long isis removes information precise occurrence times action potentials interval cross correlation pair interval trains computed averaged trials average shift predictor subtracted show thick lines pairs neurons pairs peaks standard errors level shift predictor peaks average centered msec msec width msec msec peaks long intervals trains short intervals long intervals defined accounted duration data longer short intervals note small fraction number isis spike train typically long intervals amount time short intervals data msec processed avoiding lack final interval longest intervals peaks pushed level noise thin lines action potentials serve rate periods long isis dominate peaks correlated gaps consistent common inhibition neurons local region cortex inhibition adds area spike train peaks form broader base shown data analyzed behav animals gaps related small saccades degree correlated neuronal response time scales mechanisms iiii iiii iiii iiiii iiii iiii iiiii iiii iiii iiiii iiii iiiii iiiii iiii iiii iiii iiii msec iiii time msec long short time msec figure response coherent preferred direction motion occasional long interspike intervals gaps suppressed response null direction motion interrupted bursts spikes simultaneous spike trains neurons show correlated gaps preferred direction response interval representation spike trains interval peaks indicating gaps correlated text bair koch fixation window hypothesized suppression saccadic visual suppression operate pathways neuronal origin alternative hypothesis gaps bursts arise cortex intrinsic circuitry arranged opponent fashion conclusion common input central peaks order tens milliseconds wide spike train responsible causing correlation spike count time scale long trials longterm correlation exists average cell pairs represent source noise accurate measurement correlation area peak spike train window msec basis good prediction spike count correlation coefficient noisy measure correlation neurons correlated gaps observed response coherent preferred direction motion consistent common inhibition contributes area spike train peak correlation spike count correlation spike count important factor limit neuronal ensembles acknowledgements william newsome michael anthony movshon providing data recorded previous studies helpful discussion work funded office naval research force office scientific research supported hanson foundation howard hughes medical institute references correlation intrinsic firing patterns synaptic responses neurons mouse cortex bair analysis temporal structure spike trains visual cortical area thesis california institute technology newsome movshon analysis visual motion comparison neuronal psychophysical performance neurosci independent messages carried adjacent inferior temporal cortical neurons neurosci suppression contrasts sensitivity vision newsome correlated neuronal discharge rate implications psychophysical performance nature
6 optimal stochastic search adaptive momentum todd leen oregon graduate institute science technology department computer science engineering portland oregon abstract stochastic optimization algorithms typically learning rate schedules behave asymptotically dynamics leen moody algorithms easy path results squared weight error normality apply approach stochastic gradient algorithms momentum show late times learning governed effective learning rate momentum parameter describe behavior asymptotic weight error give conditions insure optimal convergence speed finally results develop adaptive form momentum achieves optimal convergence speed independent introduction rate convergence gradient descent algorithms batch stochastic improved including weight update momentum term tional previous weight update authors give conditions convergence covariance weight vector momentum constant learning rate stochastic algorithms require learning rate decay time order achieve true convergence weight probability square probability leen paper previous work weight space probabilities leen moody leen study convergence stochastic gradient algo rithms annealed learning rates momentum approach simple derivations previously results extension stochastic descent momentum specifically show squared weight drops maximal rate effective learning rate greater critical determined hessian results suggest algorithm automatically adjusts momentum coefficient achieve optimal convergence rate algorithm simpler previous approaches estimate curvature directly descent measure statistic directly involved opti mization darken moody density evolution asymptotics stochastic optimization algorithms weight attention neighborhood local optimum express dynamics terms weight error simplicity treat continuous time algorithm learning rate time weight update function data algorithm time stochastic gradient algorithms minus gradient instantaneous cost function convergence square characterized average squared norm weight error trace weight error correlation matrix probability density time leen moody show probability density evolves expansion algorithms executed discrete time continuous time formulations analysis passage discrete continuous time treated ways depending theoretical exposition kushner clark define time functions interpolate discrete time process order establish equivalence asymptotic behavior discrete time stochastic process solutions deterministic differential equation heskes draws results link discrete time random walk trajectories solution continuous time master equation heskes master equation equivalent expansion optimal stochastic search adaptive momentum denotes component vector denotes averaging density inputs differentiating respect time integrating parts obtain equation motion weight error correlation asymptotics weight error correlation convergence understood studying late time behavior update function general nonlinear time evolution correlation matrix coupled higher moments weight error learning rate assumed follow satisfies requirements convergence square local optimum late times density sharply peaked suggests expand power series retain lowest order nontrivial terms leaving hessian average cost function diffusion matrix evaluated local optimum note understanding valid large solution evolution operator assume loss generality coordinates chosen diagonal eigenvalues obtain general density nonzero components basin neglecting purpose calculating moment local density vicinity leen define identify regimes behavior fundamentally drops asymptotically drops asymptotically slowly figure shows results simulations ensemble networks trained prediction simulations input data drawn gaussian variance targets generated noisy teacher neuron targets upper curves plot dotted depict behavior remaining curves solid show behavior left simulation results ensemble onedimensional algorithms theo predictions equation curves correspond bottom minimizing coefficient optimal learning rate found formalism yields asymptotic normality simply leen conditions optimal convergence weight error correlation related results asymptotic normality previously discussed stochastic approximation literature darken moody goldstein white references present formal structure results relative ease facilitates extension stochastic gradient descent momentum stochastic search constant momentum discrete time algorithm stochastic optimization momentum optimal stochastic search adaptive momentum continuous time interested late time behavior define variable arguments previous sections expand power series retaining lowest order nontrivial terms approximation correlation matrix evolves identity matrix defined evolution operator solution squared norm weight error diagonal elements coordinates diagonal find reduces equation defines regimes interest drops asymptotically drops asymptotically slowly leen form conditions show asymptotics gradient descent momentum governed effective learning rate figure compares simulations predictions simulations performed ensemble networks trained previously additional momentum term form upper curves dotted show behavior solid curves show behavior derivation asymptotic normality proceeds similarly case reader referred leen details gorithms momentum theoretical predictions equation left simulation results ensemble onedimensional curves correspond bottom adaptive momentum optimal convergence optimal constant momentum parameter obtained minimizing imposing restriction parameter positive result practical general unknown linear networks alternative instantaneous esti mate network input time define adaptive momentum parameter algorithm based late time convergence optimally fast alternative route achieving goal momentum term adaptively adjust learning rate proposed algorithm diverges appears converge lations observed additional study required determine range improving learning optimal stochastic search adaptive momentum iteratively estimates algorithms estimate adjust darken moody propose measuring auxiliary statistic call drift determine adaptive momentum scheme generalizes multiple dimensions easily algorithm unlike darken scheme involve calculating auxiliary statistic directly involved minimization natural extension dimensions define matrix momentum coefficients identity matrix negative eigenvalues obtain adaptive momentum matrix simulations networks initialized left dashed curves correspond adaptive momentum figure shows adaptive momentum achieves optimal vergence rate independent learning rate late times independent smaller left graph displays simulation results momentum convergence rates depend optimal large initially significant spreading increased convergence rate result lower late times graph shows simulations adaptive momentum initially spreading greater momentum quickly decreases reach smaller addition optimal convergence rate achieved values curves words late times independent adaptive momentum summary dynamics weight space probabilities derive behavior weight error correlation annealed stochastic gradient algo rithms momentum late time behavior governed effective learning learning rate squared norm weight error falls results developed form momentum adapts obtain optimal convergence rates independent learning rate parameter leen acknowledgments work supported grants force office scientific research electric power research institute references relation master equations random solutions journal mathematical physics darken john moody faster stochastic gradient search moody hanson advances neural information processing systems morgan kaufmann publishers mateo goldstein square optimality continuous time procedure technical report dept mathematics university southern california kushner clark stochastic approximation methods unconstrained systems springerverlag york heskes kappen learning neural networks local minima physical review todd leen john moody weight space probability densities stochastic learning dynamics equilibria giles hanson cowan advances neural information processing systems morgan mann publishers mateo leen weight space probability densities stochastic learning transients basin times giles hanson cowan advances neural information processing systems morgan mann publishers mateo leen momentum optimal stochastic search mozer smolensky touretzky elman weigend proceedings connectionist models summer school john algorithm momentum proceedings ieee international symposium circuits systems properties momentum algorithm signal processing extension procedure annals mathematical statistics white learning artificial neural networks statistical neural computation
4 greens function method fast online learning algorithm recurrent neural networks chen institute advanced computer studies laboratory plasma research university maryland college park abstract learning algorithms recurrent neural networks backpropagation rumelhart werbos forward gation williams zipset main drawback backpropagation offline backward path time error online requirement practical applications forward tion algorithm online manner drawback heavy load required update high dimensional matrix operations time step develop fast forward algorithm challenging task paper proposed forward learning algorithm order faster operations time step sensitivity matrix algorithm basic idea integrating high dimensional sensitivity dynamic equation solve forward time greens function avoid redundant computations update weights error corrected numerical classifying state trajectories recurrent network presented faster speed proposed algo rithm williams algorithm introduction order deal sequential signals recurrent neural networks forward model issue recurrent networks search efficient online training algorithm error backpropagation hinton originally proposed handle feedforward networks method plied train recurrent networks time sequence mappings multilayer feedforward layer identical weights nature backward path basically offline method pineda generalized recurrent networks hidden rons interested fixed point type behaviors pearl proposed scheme learn temporal involves equations solved backward time essentially generalized version error backpropagation problem learning target state viable online method date rtrl real time recurrent learning algorithm williams propagates chen matrix forward time main drawback algorithm high cost compu tation number operations time step faster operations online algorithm appears desirable toomarian barhen proposed online algorithm derived equations backpropagation approach convert backward path forward path adding delta function source term correct problem straightforward ical implementation acknowledged theory result correct mistake defined delta function integration briefly function speaking uous integral depends distribution function uniquely defined deal discontinuity carefully splitting time interval segments find adding delta function source term affect basic property adjoint equation solved backward time recently toomarian barhen modified approach proposed alternative online training algorithm nature result similar presented paper approach straightforward easily implemented numerically proposed algorithm combination back propagation data block size forward propagation blocks online algorithm chen studied problem general approach variational proach constrained optimization problem lagrangian multipliers dynamic equation lagrangian multiplier derived adjoint taking advantage linearity equation online algorithm derived numerical implementation algorithm addressed paper paper present approach problem greens function method advantages method simple mathematical formulation easy numerical implementation numerical trajectory classification presented faster speed proposed algorithm numerical results algorithm greens function approach definition problem fully recurrent network neural activity represented ndimensional dynamic equations written general order differential equa tions matrix representing weights adjustable parameters vector representing neuron units clamped external input signals time simple network connected order weights nonlinear function function instance sigmoid function suppose part state neurons measurable part neurons greens function method fast online learning algorithm recurrent neural networks hidden measurable units desired output order train network objective functional error measure functional functional depends weights implicitly measurable neurons typical error function gradient descent learning modify weights order evaluate integral function term easily obtained taking derivative error term solve differential equation easily derived taking derivative respect ward algorithm recurrent networks solve equation forward time make weight correction input sequence algorithm developed inde researchers page limitation refer related papers simply call williams zipset algorithm online learning make weight correction error corrected input sequence proof convergence online learning algorithm addressed main drawback forward algorithm requires operations time step update matrix goal greens function approach find line algorithm requires computation load greens function solution analyze computational complexity integrating directly rewrite linear operator defined types redundancy operator depend explicitly means solving repeatedly solve iden tical differential equation components redundant higher order connection weights redundancy special form neural computations activity function sigmoid function chen neuron kronecker delta function components order tensor computed repeatedly original forward learning scheme attention green function approach avoid redundancy solving dimen greens function construct solution product greens function turn reduced product green function operator defined dual time tensor function satisfies equation solution solution original equation constructed source term integral find greens function solution introduce tensor function satisfies homogeneous form solution greens function constructed heaviside function defined easily verify constructed greens function shown correct satisfies substituting obtain solution greens function method fast online learning algorithm recurrent neural networks note formal solution satisfies satisfies required initial condition online weight correction time obtained easily implementation implement numerically introduce auxiliary memories fine inverse matrix easy dynamic equation define order tensor satisfies weight correction vector solution linear equation discrete time chen awij summarize procedure green function method simultaneously forward time starting error message generated solve update weights memory size required algorithm simply storing speed algorithm analyzed update operations time step solve update operations time step online updating weights totally operations time step order magnitude faster current forward learning scheme numerical simulation present section numerical examples demonstrate proposed learning algorithm benchmark algorithm class class class phase space trajectories shapes trajectory shown column examples recurrent neural networks trained recognize shapes trajectory trajectory classification problem input data time series greens function method fast online learning algorithm recurrent neural networks dimensional coordinate pairs sampled types trajectories phase space sampling uniformly trajectory equations uniformly distributed random parameter changed trajectories distorted examples class shown neural fully recurrent firstorder network dynamics vectors state input neurons symbol represents concatenation number state input neurons represent normalized vector neural network structure shown check state neurons input state input recurrent neural network trajectory classification recognition trajectory data sequence input neurons state neurons evolve dynamics input series check state neurons classify input trajectory winnertakeall rule training assign desired final output trajectory classes integrate calculated error solve decomposition algorithm finally update weights classification error generated input sequence learning online present compare speeds proposed fast algorithm williams algorithms number iterations compare time results shown table iteration present training patterns class patterns chosen randomly selecting values time ratio indicating green function algorithm order faster issue considered error convergent rate learning rate called algorithms calculate weight correction numerical schemes outcomes result error convergent rates slightly learning rate numerical simulations conducted learning results good testing recognition perfect single misclassification found training error convergence rates differ numerical experiments show proposed fast algorithm converges slower chen williams small size neural nets faster large size neural fast algorithm ratio number iterations number iterations number iterations table time seconds comparison implemented workstation learning trajectory classification conclusion greens function develop faster online learning algorithm recur rent neural networks algorithm requires operations time step order faster williams algorithm memory required feature algorithm straightforward formula easily implemented numerically numerical trajectory classification demonstrate speed fast algorithm compared williams algorithm references hinton williams learning internal representations error propagation parallel distributed processing press werbos gression tools prediction analysis behavior sciences thesis harvard university pineda generalization backpropagation recurrent neural networks phys letters pearlmutter learning state space trajectories recurrent neural networks neural computation williams zipser learning algorithm continually running fully recurrent neural networks tech report report ucsd jolla november toomarian barhen gulati application adjoint operators neural learning appl math lett toomarian barhen temporal learning algorithms neural networks advances neural information processing systems lippmann moody touretzky morgan kaufmann schmidhuber learning algorithm fully recurrent networks tech report institut fiir informatik technische miinchen chen fast online learning algo rithm recurrent neural networks proceedings joint conference networks seattle washington page june
11 linear dimension bounds piecewise polynomial networks peter bartlett department system engineering australian national university canberra australia peter department mathematics technion israel meir department electrical engineering technion israel technion abstract compute upper lower bounds dimension feedforward networks units piecewise polynomial activa tion functions show number layers fixed dimension grows number parameters network result stands case number layers unbounded case dimension grows motivation dimension important measure complexity class binary valued functions characterizes amount data required learning setting paper establish upper lower bounds dimension specific class multilayered feedforward neural networks class functions computed feedforward neural network weights computational units piecewise polynomial activation function goldberg shown constant koiran sontag demonstrated network lead conclude bounds linear dimension bounds piecewise polynomial networks fact tight constant proof establish lower bound made fact number layers grow practical applications number small constant question remains obtain bound realistic scenario number layers fixed contribution work proof upper lower bounds dimension piecewise polynomial nets upper bound behaves number layers fixed superior previous result behaves ideas derive lower bound dimension maass shows threelayer networks threshold activation functions binary inputs dimension sakurai shows true twolayer networks threshold activation functions real inputs easy show results imply similar lower bounds threshold activation function replaced piecewise polynomial activation function bounded distinct limits conclude number layers fixed dimension piecewise polynomial networks layers real inputs piecewise polynomial networks layers binary inputs grows note piecewise polynomial networks considered work easy show dimension closely related similar bounds constants hold independently sakurai obtained similar upper bounds improved lower bounds dimension piecewise polynomial networks upper bounds begin technical discussion precise definitions vcdimension class networks considered work definition system subsets shattered subset exists vcdimension denoted largest integer exists cardinality shattered intuitively dimension measures size largest points labelings achieved sets convenient talk dimension classes indicator functions case simply identify sets points subsets notation feedforward multilayer network directed acyclic graph represents parametrized realvalued function real inputs node called input unit computation unit computation units arranged layers edges allowed input units computation units edge computation unit computation unit unit lower layer single unit final layer called output unit input unit real components input vector computation unit real called units output edge real parameter computation unit output computation unit ranges edges leading bartlett meir unit parameter weight edge output unit edge emerges parameter bias unit called activation function unit argument called input unit suppose unit output unit activation function fixed piecewise polynomial function form polynomial degree degree activation function output unit identity function denote number computational units layer suppose total param eters weights biases computational units input parameter vector denote output network denote class functions computed architecture vary parameters computation dimension class functions giving main theorem section present result slight improvement result chapter lemma suppose fixed polynomials degree variables number distinct sign vectors generated varying main result theorem positive integers network real inputs parameters computational units arranged layers single output unit identity activation function computation units piecewise polynomial activation functions degree class realvalued functions computed network fixed implies presenting proof outline main idea construction fixed input output network corresponds piecewise polynomial function parameters degree larger recall layer linear parameter domain split regions function polynomial lemma obtain upper bound number sign assignments attained varying parameters polynomials theorem established combining bound bound number regions proof theorem arbitrary choice points bound linear dimension bounds piecewise polynomial networks points partition parameter domain choose partition region fixed polynomials degree lemma term remaining point construct partition determine upper bound size partition constructed recursively procedure partition constants break points piecewise polynomial activation functions affine func tion describing input unit layer response weights unit layer note partition determined solely parameters hidden layer input layer unaffected parameters output layer unit response fixed polynomial number variables computing unit outputs layer number computation units layer recall choose number sign assignments arline functions variables lemma shows define assume input unit layer response fixed polynomial function degree partition refinement constants polynomial function describing input unit layer response implies output layer unit response fixed polynomial degree finally choose number sign assignments polynomials variables degree lemma notice input unit layer bartlett meir response fixed polynomial function degree proceeding partition network output response fixed polynomial degree multiplying bound result points chosen arbitrarily bound number dichotomies induced points upper bound vcdimension obtained computing largest number yielding base conclude lemma briefly mention application result problem learning gression function inputoutput pairs drawn independently random unknown distribution case quadratic loss show exist constants noise variance approximation error function class approxi minimizes sample average quadratic loss making recently derived bounds approximation error equal logarithmic factors obtained networks units stan dard sigmoidal function combining considerably lower bounds piecewise polynomial networks obtain error rates sigmoid networks lower bound compute lower bound dimension neural networks continuous activation functions result generalizes lower bound holds number layers linear dimension bounds piecewise polynomial networks theorem suppose properties differentiable point derivative feedforward network properties network layers parameters output unit linear unit computation units activation function functions computed network largest integer equal proof proof theorem show functions computed network track number parameters layers required prove lower bound network linear threshold units linear units identity activation function show output unit replaced units activation function resulting network details proof full paper positive integers construct points shattered network weights layers denote parameters binary representation base representation inputs bits defined similarly show extract bits input network outputs inputs form values result follow stages computation computing extracting selecting suppose network input linear unit compute involves parameters computation unit layer fact parameters extra parameter show linear unit replaced unit activation function parameter recursion initial conditions compute layers parameters computational units compute approach fewer layers input vector implying bartlett meir order conclude proof show variables recovered depending inputs boolean computation involves additional parameters computational units adds layers total layers parameters network size parameters layers affecting function network provided case vcdimension network constructed linear threshold units linear units easy show theorem unit output unit replaced unit activation function network size linear units input output weights scaled linear function approximated accuracy neighborhood point linear threshold units input weights scaled behavior infinity accurately approximates linear threshold function references anthony bartlett neural network learning theoretical foundations cambridge university press blumer ehrenfeucht haussler warmuth learn ability vapnikchervonenkis dimension bartlett meir linear vcdimension bounds piecewise polynomial networks neural computation goldberg bounding dimension concept classes parameterized real numbers machine learning koiran sontag neural networks quadratic journal computer system science maass neural nets vcdimension neural putation meir optimality stochastic approximation smooth functions neural networks submitted publication sakurai tighter bounds vcdimension threelayer works world congress neural networks volume pages hillsdale erlbaum sakurai tight bounds vcdimension piecewise networks advances neural information processing systems volume press vapnik estimation dependences based empirical data springerverlag york theory learning generalization springer verlag york
7 neural model schizophrenia ruppin james department computer science university maryland college park david horn school physics aviv university aviv israel abstract implement study computational model stevens theory schizophrenia theory onset schizophrenia reactive synaptic regeneration occurring brain regions receiving temporal lobe projections concentrating area frontal cortex model frontal module associative memory neural network input synapses represent incoming temporal projections analyze face weakened external input projections compensatory internal synaptic connections increased noise levels maintain capacities generally preserved schizophrenia compensatory lead biased retrieval stored memories corresponds occurrence apparent external trigger tendency central results explain tend schizophrenia progresses intervention leads slower response ruppin james david horn introduction growing interest recent years neural models brain cognitive behavioral effects recent published examples studies include models cortical plasticity stroke disease schizophrenia cognitive behavioral acquired disorders reviewed continuing line study present computational account linking specific pathological synaptic postulated occur schizophrenia emergence denote persistent unrealistic percepts times patient manner wealth data gathered schizophrenia frontal temporal hand hippocampus areas including neuronal loss hand metric studies expansion receptor binding sites increased dendritic branching frontal cortex stevens recently sented theory linking temporal frontal findings onset schizophrenia reactive anomalous synaptic organization taking place projection sites temporal neurons including cortical subcortical structures frontal paper presents computational study stevens theory frame work memory model interaction show introduction microscopic synaptic underlie stevens preserve memory function results specific pathological macroscopic behavior network small subset patterns stored network spontaneously retrieved times specific input pattern emergent behavior shares important char frequently absence apparent external trigger tend concentrate limited recurrent memory capacities fairly preserved late stages disease section present model analytical numerical results obtained section conclusions section model illustrated figure model frontal module associative memory attractor neural network receiving input memory cues decaying input fibers representing temporal projections works internal connections store memorized patterns undergo synaptic model reactive synaptic regeneration frontal module effect diffuse external projections modeled back ground noise frontal module represents unit suggested basic functional building block neocortex assumption memory retrieval frontal cortex invoked firing incoming neural model schizophrenia temporal projections based notion temporal structures portant role establishing longterm memory neocortex retrieval facts events cortical noise figure schematic illustration model frontal module modeled attractor neural network neurons receive inputs kinds connections internal connections frontal neurons external connections temporal lobe neurons diffuse external connections cortical modules modeled noise attractor network variant hopfields model proposed tsodyks neuron binary variable denoting active firing passive quiescent state distributed memory patterns stored network elements memory pattern chosen probability neurons network fixed uniform threshold initial state weights internal synaptic connections postsynaptic potential input field neuron internal contributions neurons external projections updating rule neuron time prob ruppin james david horn sigmoid function denotes noise level activation level stored memories measured overlaps current state network defined retrieval modeled field memorized patterns pattern presentation external input network state evolves converges stable state network parameters tuned initial state correctly patterns examine networks behavior absence specific stimulus network continue state random baseline activity converge stored memory state refer process spontaneous retrieval investigation stevens work proceeds stages examine behavior network undergoes uniform synaptic represent pathological occurring accordance stevens theory include external input projections increase internal projections noise levels stage assumption internal synaptic compensatory additional hebbian activitydependent component examine effect rule neuron firing quiescent iterations constant results show simulation analytic results examining effects pathological taking place accordance stevens theory macroscopic behavior network analytical results presented derived calculating magnitude randomly formed initial biases comparing effect networks dynamics versus effect externally sented input cues comparison performed formulating overlap master equation fixed point dynamics investigated phase plane analysis study reactive synaptic occurring internal external diffuse synapses extent maintaining memory capacities face external input synapses illustrated figure find increased noise levels degree preserve memory retrieval face decreased external input strength increased synaptic preserves neural model schizophrenia figure retrieval performance measured average final overlap function noise level curve displays relation magnitude external input projections simulation results analytic approximation memory retrieval similar manner combined effect synaptic compensatory measures compensatory synaptic maintain memory capacities necessarily effects leading eventually emergence spontaneous activation memory patterns network converges memory patterns pathological autonomous manner absence external input stimuli emergence pathological retrieval noise level internal synaptic strength increased point demonstrated figure compensatory regeneration internal synapses additional hebbian component representing period increased activitydependent plastic synaptic biased spontaneous retrieval tribution obtained time evolves measured time units trials distribution patterns spontaneously retrieved network ical manner concentrate memory patterns stored network shown figure highly peaked distribution maintained hundred additional trials memory retrieval sharply global attractor formed mixed attractor state high overlap memorized pattern represent welldefined cognitive perceptual item state hebbian activitydependent evolution network activity dependent spontaneous activity emerge distribution retrieved memories remains homogeneous figure eventually global ruppin james david horn figure spontaneous retrieval measured highest final overlap achieved stored memory patterns displayed function noise level spontaneous retrieval function internal synaptic compensation factor attractor formed network retrieval capacities process memory pattern dominate retrieval output results remain qualitatively similar bounds absolute magnitude synaptic weights conclusions results suggest formation biased spontaneous retrieval requires occurrence external input fibers hebbian synaptic connections support plausibility stevens theory showing real ized neural model account characteristics emergence spontaneous retrieval phenomenon eventually meaningless global attractor formed clinical finding schizophrenia progresses tend negative enhanced converged network larger tendency remain biased memory state biased accordance persistent characteristic spontaneous retrieval trials occur frequency spontaneous retrieval increases early treatment young neural model schizophrenia trials trials trials memories trials trials figure distribution memory patterns spontaneously retrieved axis memories stored yaxis denotes retrieval frequency memory distribution retrieval memories leads early response days late delayed intervention leads slower response months current model generates testable predictions level model tested quantitatively searching positive correlation recent history findings synaptic compensation kind correlation indices synaptic area cognitive functioning found patients physiological level increased compensatory noise increased spontaneous neural activity prediction difficult examine directly studies show significant increase delta activity reflect increased spontaneous activity clinical level formation large deep basin attraction memory pattern focus spontaneous retrieval proposed model predicts retrieval tion frequently triggered environmental cues recent study points direction ruppin james david horn acknowledgement research supported fellowship ruppin references connectionist models handbook volume press ruppin neural modeling disorders network computation neural systems review paper stevens abnormal basis schizophrenia recurrence consecutive episodes schizophrenia disorder schizophrenia research schizophrenia brain england journal medicine columnar distribution corticocortical fibers frontal association motor cortex developing monkey brain memory hippocampus synthesis findings rats monkeys humans psychological review tsodyks enhanced storage capacity neural networks activity level europhys lett horn ruppin synaptic compensation attractor neural networks modeling findings schizophrenia neural computation page carpenter schizophrenia england journal medicine schizophrenia brain disease dopamine receptor story neurol synapse loss frontal cortex disease correlation cognitive neurology abnormal responses stimulation patients schizophrenia study preliminary findings david editors schizophrenia erlbaum
9 dynamically adaptable cmos winnertakeall neural network information technology research laboratories sharp japan abstract major problem prevented practical application analog poor accuracy fluctuating analog device characteristics inherent device result paper proposes dynamic control architecture analog silicon neural networks compensate fluctuating device characteristics adapt change input level applied architecture compensate input offset voltages analog cmos winnertakeall chip fabricated experimental data show effectiveness architecture introduction analog vlsi implementation neural networks silicon retinas adaptive filters focus active research utilizes physical laws electric devices obey neural operation circuit scale smaller digital counterpart massively parallel implementation major problem prevented practical applications fluctuating analog device characteristics inherent device result main reason analog devices digital devices analog neuro vlsi expected problem making view fact spite components biological neural networks show excellent competence paper proposes cmos circuit architecture dynamically fluctuating component characteristics time adapts device state incoming signal levels engineering techniques compensate threshold fluctuation comparator change mode achieve desired effect modes adaptation signal processing extra clock signals needed break signal processing takes place incoming signals consist rapidly changing foreground component slowly varying background component process signals biological neural networks make multiple channels scales channel suppress background floating channel devoted process foreground signal proposed method inspired biological consideration utilizes frequency bands adaptation signal processing figure negative feedback applied pass filter feedback affect foreground signal processing pass filter input signal processing output signal pass background band foreground band frequency figure dynamic adaptation frequency divided control model diagram frequency division part paper working analog cmos chip test fabricated introduced dynamical adaptation chip experimental results presented analog cmos chip architecture specification layer layer feedback controller figure analog cmos chip architecture dynamically adaptable cmos winnertakeall neural network figure circuit diagrams competitive cell feedback controller basic building block construct analog circuits investigated researchers lazzaro cmos analog circuits based voltage circuits realize competition inhibitory interaction feedback mechanisms enhance resolution gain architecture chip fabricated shown figure circuit diagram figure chip lowest input voltage making output voltage corresponds lowest input voltage winner power supply voltage circuit similar represents advances steering current feedback controller line allowing winner cell compete region resolution gain largest feedback controller originally competitive layer removed order guarantee existence output node voltage table shows specifications fabricated chip table specifications fabricated chip process number input nodes power dissipation measured power supply resolution theoretical settling time measured area cmos input offset voltage input offset voltages chip greatly chip performance examples input offset voltage distribution fabricated chips shown figure input offset voltage measured relative input node input offset voltage input node defined voltages output nodes equal fixed voltage voltage input nodes fixed high voltage figure examples measured input offset voltage distribution primary factor input offset voltage considered fluctuation transistor threshold voltages layer competitive cell input offset voltage cell yielded small fluctuation calculated transconductance drain conductance design process parameters estimate input offset voltage based experiences maximum fluctuation chip smaller reasonable difference smaller compose current mirror closely implies maximum rough agreement measured data dynamical adaptation architecture figure show circuit implementation dynamically adaptable function feedback channel difference output reference back input node pass filter consisting charge stored capacitor controlled feedback signal linear approximation chip characteristic voltages nodes functions input offset voltage relative node considered difference hand characteristic feedback path approximated dynamically adaptable cmos winnertakeall neural network chip figure chip equipped adaptation circuit equations term derived assumptions means voltage difference level input clamped capacitor turn implies input offset voltage successfully compensated role pass filters twofold guarantee stable dynamics feedback loop make cutoff frequency pass filters small gain feedback path phase feedback signal delayed prevent feedforward operation affected shown figure adaptive control carried frequency band operation experimental results experiments adaptable function carried applying pulses input nodes input nodes fixed voltage figures output waveforms waveform pulse applied node shown figure shows result pulse applied figure shows result amplitude pulse greater pulse schematic explanation behavior figure outputs remained levels inputs strong result adaptation winning frequencies output nodes equal long time scale explains unstable output period quiescent inputs chip measurement relative input offset voltage nodes figure offset voltage completely compensated output waveforms nodes figure output waveforms dynamically adaptable cmos neural network pulse waves applied nodes nodes voltages fixed amplitude pulse output waveforms amplitude pulse greater output voltage winner high period pulse inputs outputs quiescent winner hysteresis unstable figure schematic explanation dynamically adaptable behavior conclusion proposed dynamic adaptation architecture frequency divided control applied cmos chip fabricated experimental results show architecture successfully compensated input offset voltages dynamically adaptable cmos winnertakeall neural network chip inherent device characteristic fluctuations architecture analog ability adapt incoming signal background levels applications vision chips adaptation compensate fluctuation sensor characteristics adapt gain sensors background illumination level automatically control color balance application figure describes analog neuron weighted synapses time constant larger time constant input signals inputs output figure analog neuron weighted synapses time constant larger input signals architecture nonoverlapping frequency bands adaptation background foreground signal processing requires implementing circuits completely time scale constants modern vlsi technology difficult problem processes high resistances acknowledgment authors chip fabrication support experimental work references vlsi winnertakeall circuit organizing neural networks ieee solidstate circuits lazzaro mahowald mead winnertakeall networks complexity touretzky advances neural information processing systems cambridge press neural voltage comparator network electron lett inhibitory mechanism analysis complexity winner takeall networks ieee trans circuits syst
5 learning control extreme uncertainty gullapalli computer science department university massachusetts amherst abstract insertion task illustrate utility direct associative reinforcement learning methods learning control realworld conditions uncertainty noise task complexity hole presence positional uncertainty magnitude exceeding times extreme degree uncertainty results direct reinforcement learning learn robust reactive control strategy results insertions introduction control tasks interest today involve controlling complex nonlinear systems uncertainty noise traditional control design techniques effective circumstances methods learning control increasingly popular control tasks difficult obtain training information form prespecified instructions perform task supervised learning methods directly applicable time evaluating performance controller task fairly straightforward tasks ideally suited application associative learning barto anandan purposes noise regarded simply sources uncertainty gullapalli associative reinforcement learning learning systems interactions environment evaluated critic goal learning system learn respond input action expected tion learning control tasks learning system controller actions control signals evaluations based performance criterion control task kinds associative reinforcement learning methods direct indirect distinguished gullapalli indi reinforcement learning methods construct model environment critic modeled separately direct reinforcement learning methods previously argued gullapalli barto gullapalli presence uncertainty learning adequate indirect methods training difficult direct reinforcement learning methods situations paper insertion task illustrate utility direct associative reinforcement learning methods learning control realworld conditions uncertainty insertion insertion widely testing proaches robot control studied canonical robot assembly operation gordon abstract task solved easily realworld conditions uncertainty errors noise sensory feedback errors execution motion uncertainty movement part grasped robot substantially degrade performance traditional control methods approaches proposed insertion uncertainty grouped major classes methods based offline planning methods based reactive control offline planning methods combine geometric analysis configuration analysis task determine motion strategies result successful insertion gordon ence uncertainty sensing control researchers suggested incorporating uncertainty geometric model task configuration space line planning based assumption realistic characterization margins uncertainty strong assumption dealing realworld systems methods based reactive control comparison counter effects uncertainty online modification motion control based sensory feed back motion control trajectory modified contact forces tactile stimuli occurring motion behavior actively generated occurs physical char robot points tasks including insertion task require complex nonlin behavior capability passive mechanism humans find difficult learning control extreme uncertainty behavior presence uncertainty techniques learning behavior demonstrate approach learning reactive control strategy insertion training controller perform insertions robot equipped joint position encoders axis force sensor outputs subject uncertainty describing controller presenting performance insertion present experimental data quantifying uncertainty position force sensors quantifying sensor uncertainty order quantify position uncertainty varying load conditions similar occur interacting hole compared sensed position actual position cartesian space load conditions experiment robot maintain fixed position loads conditions applied sequentially load fixed load applied directions condition position force feedback robot sensors actual position recorded sensed actual positions shown table sensed positions computed joint positions sensed zeros joint position encoders table large discrepancy sensed actual positions actual change position external load order largest sensed change position comparison hole task observations robot determine uncertainty position primarily factors affecting uncertainty include robot affects loaded interactions environment table sensed actual positions load conditions load condition sensed position actual position load position load load load final load position figure shows timestep samples force sensor output load conditions figure considerable sensor noise recording moments designing controller robustly perform insertions large uncertainty sensory input gullapalli difficult results controller learn robust insertion strategy figure timestep samples sensed forces moments load conditions ideal sensor constant timestep interval learning insertion approach learning reactive control strategy insertion certainty based active generation behavior nonlinear mapping sensed positions forces position commands controller learns mapping repeated attempts insertion insertion tasks depicted figure versions insertion task attempted version task long wide hole wide hole version long diameter hole diameter case controller implemented connectionist network operated closed loop robot learn reactive trol strategy performing insertions network task inputs sensed positions forces gullapalli learning control extreme uncertainty insertion task insertion task forces moments controls commands figure insertion tasks outputs forming position command hidden layers units task network inputs sensed positions forces outputs forming position command hidden layers units networks hidden units backpropagation units output units stochastic realvalued reinforcement learning units gullapalli units direct reinforcement learning algorithm find realvalued output input gullapalli details position inputs network computed sensed joint positions forward kinematics equations force moment inputs sensed force sensor loop robot position output network time step training methodology controller network trained sequence trials started random position orientation respect hole successfully inserted hole time steps insertion termed successful inserted depth hole time step training sensed position forces input network computed control output executed robot resulting motion evaluation performance ranging denoting evaluation computed based position forces acting forces denotes largest magnitude force component closer sensed position desired position inserted hole higher evaluation large sensed forces reduced evaluation evaluation network adjusted weights appropriately cycle repeated gullapalli performance results learning curve showing final evaluation consecutive trials task shown figure final evaluation levels close figure smoothed final evaluation received smoothed insertion time simulation time steps consecutive trials insertion task smoothed curve obtained filtering data window consecutive values trials amount training controller consistently perform successful insertions time steps performance measured insertion time continues improve learning curve figure shows time insertion decreasing continuously trials curves controller progressively insertion training similar results obtained task learning slower case performance curves task shown figure discussion conclusions task ance high degree uncertainty sensory feedback fine motion control requirements insertion make consideration learning control extreme uncertainty positional uncertainty order times clear hole primarily significant uncertainty sensed forces moments sensor noise results direct reinforcement learning learn control strategy works robustly presence high degree uncertainty learning control extreme uncertainty figure smoothed final evaluation received smoothed insertion time simulation time steps consecutive insertion task smoothed curve obtained filtering data window consecutive values studied similar tasks work learning hole insertion assumed positional uncertainty order magnitude results presented simulated systems results approach works physical system higher magnitudes noise greater degree uncertainty inherent dealing physical systems success direct reinforcement learning approach training controller approach automatically robot control strategies satisfy constraints encoded performance evaluations acknowledgements paper discussions andrew barto running based work supported force office scientific research grant national science foundation grant references teaching learning neural nets representa tion generation ieee international conference robotics automation pages gullapalli barto anandan pattern recognizing stochastic learning ieee transactions systems cybernetics barto gullapalli neural networks adaptive control arbib editors natural intelligence research notes neural computation springerverlag washington press assembly strategies parts proceedings ieee international conference robotics automation pages robot motion planning uncertainty geometric models robot environment formal framework error detection recovery proceedings ieee international conference robotics automation pages fine motion planning uncertainty international journal robotics research gordon automated assembly feature localization thesis massachusetts institute technology laboratory cambridge technical report gullapalli stochastic reinforcement learning algorithm learning real valued functions neural networks gullapalli reinforcement learning application control thesis university massachusetts amherst gullapalli barto learning reactive control proceedings ieee international conference robotics automation pages nice france theory threedimensional parts journal mechanisms automated sign december learning expert systems robot fine motion control editors proceedings ieee international symposium intelligent control pages ieee computer society press washington mason taylor automatic synthesis fine motion strategies robots international journal robotics research spring assembly supported rigid parts journal dynamic systems measurement control march robot motion planning control press cambridge
11 learning restricted training sets exact solution benchmark general theories sollich department mathematics kings college london london coolen abstract solve dynamics online hebbian learning perceptrons regime size training scales linearly number inputs noiseless noisy teachers calculation extended hebbian rules solution nice benchmark test general advanced theories solving dynamics learning restricted training sets introduction considerable progress made understanding dynamics supervised learning layered neural networks application methods mechanics recent review work field contained part theories concentrated systems training larger number updates circumstances probability question repeated training process negligible assume large networks central limit theorem local field tribution gaussian paper restricted training sets suppose size training scales linearly number inputs probability question training process longer negligible assumption local fields gaussian distributions clear correlations develop weights learning restricted training sets exact solution questions training training progresses fact nongaussian char local fields prediction satisfactory theory learning restricted training sets numerical simulations authors discussed learning restricted training sets general theory difficult simple model learning restricted training sets solved attractive difficult sophisticated general theories tested compared show accomplished online hebbian learning perceptrons restricted training sets tain exact solutions generalisation error training error class noisy teachers students arbitrary weight decay theory excellent agreement numerical simulations prediction probability density student field striking making clear dealing local fields nongaussian definitions study online learning student perceptron perform task defined teacher perceptron fixed weight vector assume teacher noisy actual teacher output student response vector drawn independently probability depend explicitly correct teacher vector interest choices literature output noise gaussian input noise represents probability teacher output incorrect variance chosen achieve scaling learning rule online hebbian rule nonnegative parameters rare decay rate learning rate iteration step input vector picked random training consisting randomly drawn vectors remains unchanged learning dynamics time teacher selects random independently vector probability distribution iterating equation assume noisy teacher output consistent sense question stage training process teacher makes choice cases consistency define generalised training including sollich coolen questions teacher vectors sources randomness problem random path simply randomness stochastic process evolution vector averages process denoted randomness composition training write averages training sets sets note averages training choose time unit finally assume statistically independent training vectors obey explicit microscopic expressions stage learning process simple scalar observables joint distribution fields calculated questions training infinitely large systems prove fluctuations meanfield randomness dynamics vanish assumes support numerical simulations evolution observables observed random tions training fluctuations vanish called selfaveraging properties central current theories introduction averages observables respect dynamical randomness respect randomness training carried precisely order lira lira fundamental calculations average calculated find wide class learning restricted training sets exact solution output noise gaussian input noise averages simple scalar observables calculation execute path average average sets straightforward albeit find equations examples output noise gaussian input noise note generalisation error models teacher noise generalisation error time true output noise gaussian input noise respective parameters related type teacher noise holds associate effective output noise parameter note effective teacher error probability general identical true teacher error probability immediately calculating gaussian input noise average joint field distribution calculation average joint field distribution starting equation difficult writing expressing functions terms complex find sets expression replace writing product terms auxiliary variables find large random variables sollich coolen study statistics shows gaussian random variable equal variance unity basis results equations find expressions note independent distribution assume independent condition reflects sense property teacher noise preserves perceptron structure satisfied models true reasonable noise models joint probability density form equation leads expression conditional probability observe probability distribution models dependence directly observable quantity training error student field probability density note dependence specific noise model arises solely find output noise gaussian input noise models order numerical computation remaining integrals reduce number integrations analytically details reported comparison numerical simulations clear large number parameters vary order generate simulation experiments test theory restrict presenting number representative results figure shows output noise model probability density learning restricted training sets exact solution figure student field distribution case output noise times left histograms distributions measured simulations lines theoretical predictions student field develops time starting gaussian evolving highly nongaussian distribution double peak time theoretical results give extremely satisfactory account numerical simulations figure compares predictions generalisation training errors results numerical simulations initial conditions choices important parameters controls amount teacher noise measures relative size training theoretical results excellent agreement simulations system found memory past learning rules asymptotic values independent initial student vector examples consistently larger difference pronounced increases note circumstances larger careful inspection shows hebbian learning true overfitting effects case large small large amounts teacher noise regularisation weight decay minor finite time minima generalisation error found short times combination special choices parameters initial conditions discussion starting microscopic description hebbian online learning perceptrons restricted training sets size number inputs developed exact theory terms macroscopic observables enabled predict generalisation error training error probability density student local fields limit results agreement numerical simulations carried systems size case output noise predictions gaussian input noise model compared results simulations calculations scenarios involving instance timedependent learning rates timedependent decay rates straightforward clear present calculations extended rules sollich coolen figure generalisation errors training errors observed online hebbian learning functions time upper graphs upper left upper lower graphs lower left lower markers simulation results system solid lines predictions theory cases ultimately rely ability write microscopic weight vector time explicit form provide significant sophisticated general theories tested played valuable role assessing conditions recent general theory learning restricted training sets based dynamical version replica formalism exact references coolen statistics computing krogh hertz phys math sollich barber europhys lett sollich barber advances neural information processing systems jordan kearns solla cambridge coolen saad kings college london preprint coolen saad preparation
8 contextdependent classes hybrid recurrent speech recognition system tony robinson mike hochberg cambridge university engineering department street cambridge england email abstract method incorporating contextdependent phone classes hybrid speech recognition system intro duced modular approach adopted singlelayer networks discriminate context classes phone class acoustic data context networks combined contextindependent network generate contextdependent phone probability estimates experiments show average reduction word error rate system arpa word word tasks improved modelling decoding speed system fast system introduction hybrid system performed conventional hidden markov model systems arpa evaluations speech recognition systems hochberg renals robinson hybrid framework attractive compact fewer parameters conventional systems whilst providing tive powers connectionist architecture established phones vary occur phonetic contexts vowel lowing sound shortterm contextual influence mike hochberg communications avenue building menlo park contextdependent classes speech recognition system handled hmms creating model sufficiently differing phonetic acoustic evidence modelling phones phonetic contexts produces sharper probability density functions approach improves recognition accuracy equivalent contextindependent systems recurrent neural network model acoustic context internally state vector model phonetic context paper presents approach improving system phonetic contextdependent modelling cohen franco morgan rumelhart separate sets context dependent output layers model context effects states phone models networks discriminate phones broad class left contexts training time reduced multi layer perceptron changing hiddentooutput weights contextdependent training system performs darpa resource management task work presented schwartz makhoul similar work cohen contextdependent mixture experts system jordan jacobs based structure contextindependent built state training data divided parts left context separate model built context approach phonetic contextdependent modelling mlps posed bourlard morgan based conditional probability data terms phone data context data phone approach paper mixture work work recurrent work concentrates building compact system suited requirements result context training scheme fast implemented workstation parallel processing machine training overview hybrid system basic framework system similar bourlard morgan recurrent network acoustic model framework detailed description recurrent network phone probability estimation robinson time frame acoustic vector mapped output vector represents estimate posterior probability phone classes phone class time input time left past acoustic context modelled internally dimensional state vector storing information presented input future acoustic context posterior probability estimation frames input network network trained modified version error backpropagation time robinson decoding hybrid approach equivalent tional decoding difference models state observations typical systems recognition process expressed finding maximum posteriori state sequence utterance decoding criterion requires computation likelihood acoustic robinson hochberg data phone state sequence phones drops decoding process network outputs mapped scaled likelihoods priors estimated training data decoding decoder renals hochberg compute utterance model generated observed speech signal contextdependent probability estimation approach work augment similar morgan contextdependent likelihood factored context classes contextindependent phones substituting context independent probability density function term constant frames drops decoding process purposes format extremely appealing estimated training data needed estimate approach paper context experts modules phone class augment existing training state vector estimate obtained training recurrent network discriminate contexts phone class estimate posterior probability context class phone class training recurrent neural networks format expensive difficult recurrent format network discontinuities acoustic input vectors implies recur rent networks phone classes shown data assumption made state vector good representation singlelayer perceptron trained state vectors classify phonetic context classes finally contextdependent classes speech recognition system likelihood estimates phonetic context class phone class decoding embedded training estimate parameters networks training data aligned viterbi segmentation context network trained nonoverlapping subset state vectors generated viterbi aligned training data context networks trained training procedure robinson figure phonetic contextdependent modular system phonetic context posterior probabilities required input decoder outputs context modules hand side figure posterior probabilities calculated numerator stage operates normal fashion generating posterior probabilities time modules state vector generated input order classify context class robinson hochberg posterior probability outputs multiplied module outputs form contextdependent posterior probability estimates relationship mixture experts similarities mixture experts jordan jacobs training making soft split data mixture experts case viterbi segmentation selects expert exemplar means expert responsible data assumes viterbi segmentation good approximation process expert trained small subset training data avoiding computationally expensive requirement expert data decoding treated gating network smoothing predictions experts analogous manner standard mixture experts gating network description system hochberg robinson clustering context classes problems faced contextdependent system decide context classes included system method overcoming problem based approach cluster context classes guarantees full coverage phones context context classes chosen acoustic evidence tree clustering flamework building small number contextdependent phones keeping contextdependent connectionist system architecture compact tree building algorithm based young details found trees built training data pronunciation lexicon evaluation context system contextindependent networks trained arpa wall street jour corpus phonetic contextdependent classes clustered acoustic data decision tree algorithm running data recurrent network feedforward fashion obtain million frames dimensional state vectors approximately hours workstation training contextdependent networks training data takes hours total workstation contextdependent modules crossvalidated development word level results contextdependent systems compared contextindependent baseline shown table test crossvalidation development purposes contextdependent systems applied larger tasks recent european speech recognition evaluation word development evaluation sets american english contextdependent system extended include modules trained backwards time forward context augment merged contextindependent system hochberg renals robinson contextdependent classes speech recognition system table comparison system systems word language model tasks system system system test sets eval table comparison merged systems systems word tasks tests language model evaluation results test sets system system english english english english table comparison average utterance decode speed systems systems word tasks tests language model pruning levels speedup tests utterance utterance decode speed decode speed american english british english table number parameters systems compared systems system increase parameters parameters parameters american english british english similar system built british english table shows improve ment gained context models official entries evaluation figures represent lowest reported word error rate english tasks result improved phonetic modelling class discrimination search space reduced meant decoding speed fast contextdependent system table roughly times contextdependent phones compared increase number parameters introduction context models evaluation system shown table large increase number system parameters order magnitude equivalent system built task robinson hochberg conclusions paper discussed successful integrating phonetic contextdependent classes current hybrid system architecture modular approach augment current hybrid system fast training contextdependent modules achieved training corpus hours utterance decoding performed standard decoder word error significantly reduced whilst decoding speed context system fast baseline system word tasks references bourlard morgan continuous speech recognition connection statistical methods ieee transactions neural networks bourlard morgan connectionist speech recognition hybrid approach kluwer publishers cohen franco morgan rumelhart context dependent multiple distribution phonetic modeling mlps hochberg renals robinson connectionist model combination large vocabulary speech recognition networks signal processing hochberg renals robinson hybrid recognition spoken language systems technology workshop arpa jordan jacobs hierarchical mixtures experts algorithm neural computation hochberg robinson incorporating context dependent classes hybrid recurrent speech recognition system cambridge university engineering department automatic speech recognition development system kluwer publishers renals hochberg efficient search posterior phone proba bility estimates icassp robinson application recurrent nets phone probability ieee transactions neural networks young state high acoustic modelling spoken language systems technology workshop schwartz makhoul hierarchical mixtures experts methodology applied continuous speech recognition nips
10 annealed selforganizing source channel coding klaus obermayer department computer science technical university berlin berlin germany abstract derive analyse robust optimization schemes noisy vector quantization basis deterministic annealing starting cost function central clustering incorporates distortions channel noise develop soft topographic vector quantization gorithm stvq based maximum entropy principle performs maximumlikelihood estimate expectation maximization fashion annealing temperature leads phase transitions existing code vector representation cooling process calculate critical temperatures modes function eigenvectors eigenvalues covariance matrix data transition matrix channel noise family vector quantization algorithms derived stvq deterministic annealing scheme kohonens selforganizing algorithm call ssom applied vector quantization image data noisy binary symmetric channel algorithms performance compared stvq naturally superior account channel noise results compare stvq computationally demanding introduction noisy vector quantization important coding scheme data transmitted noisy communication lines suited speech image data transmitted noise level condi tions idea jointly optimizing codebook data representation channel noise apply deter annealing scheme rose buhmann problem develop annealed selforganizing source channel coding soft topographic vector quantization algorithm stvq heskes miller stvq derive class vector quantization algorithms find ssom deterministic annealing variant kohonens selforganizing kohonen approximation ssom minimize energy function computationally demanding stvq annealing scheme enables neighborhood function solely encode desired transition probabilities channel noise opens possibilities usage arbitrary neighborhood functions analyse phase transitions annealing demonstrate performance ssom applying image data compression transmission noisy channels derivation class vector vector quantization method encoding data grouping data vectors representative data space group data vectors objective vector quantization find code vectors binary assignment variables cost function minimized denotes cost assigning data point code vector idea case code labels form compressed encoding data purpose transmission noisy channel figure distortion caused channel noise modeled matrix tran sition probabilities noise induced change assignment data vector code vector code vector transmission received index decoded code vector averaging squared euclidean distance transitions yields assignment costs factor introduced computational convenience starting cost function obtained principle maxi entropy constraint average cost lagrangian multiplier interpreted inverse temperature determines assignments order generalize training code vectors probability distribution legal sets assignments obtain assignment probability data vector code vector solving fixedpoint iteration comprises expectationmaximization algorithm estep obermayer figure generic data problem encoder assigns input vectors labeled code vectors indices transmit noisy channel charac transition probabilities decoder expands received code vector repre data vectors assigned encoding total error measured squared euclidean distance original data vector repre averaged transitions distortion channel noise determines assignment probabilities data points code vectors mstep determines code vectors assignment probabilities order find global minimum increased annealing schedule tracks solution easily solvable convex problem exact solution call solution soft topographic vector quantizer stvq starting point class vector quantization algorithms approximation applied leads soft version ssom additionally applied rose recovered leads hard versions topographic vector selforganizing kohonen focus soft selforganizing ssom ssom computationally demanding stvq offers contrast traditional robust deterministic annealing optimization scheme tend approach arbitrary nontrivial neighborhood functions required source channel coding problems noisy channels phase transitions annealing rose annealing representation data code vectors split size codebook number code vectors split point permutation symmetry broken splitting behavior code vectors infinite temperature data vector assigned code vector equal probability size codebook code vectors located center mass data expanding order fixed point assuming obtain critical annealed selforganizing source channel coding stvq step ssom step step step figure class vector derived stvq approximations limits text front stands soft probabilistic approach inverse temperature center mass solution unstable largest eigenvalue covariance matrix data corre sponds variance principal axis asso eigenvector code vectors split largest eigenvalue matrix elements component eigenvector determines code vector direc moves relative code tion principal axis vectors ssom similar result obtained simply replaced ssom details numerical results binary symmetric channel error rate assuming length code indices bits matrix elements tion matrix binary representations problem numerical analysis phase transitions previous section formed data consisting data vectors drawn twodimensional elongated gaussian distribution diagonal covariance matrix size codebook bits figure left shows positions code vectors data space functions inverse critical inverse code vectors split xaxis principal axis distribution data points accordance eigenvector largest eigen matrix code vectors hamming distance move opposite positions principal axis remain center note eigenvalues matrix figure shows critical inverse temperature function stvq crosses ssom dots results good agreement theoretical predictions solid line inset displays average cost function obermayer stvq ssom drop average cost occurs critical inverse figure phase transitions problem left code vectors ssom case plotted inverse splitting code vectors occurs good accordance theory critical values ssom dots stvq crosses determined average cost inset line stvq phase transition solid lines denote theoretical predictions convergence parameter fixedpoint iteration giving upper limit difference successive code vector positions dimension source channel coding image data order demonstrate applicability stvq ssom source channel coding employed algorithms compression image data noisy channel decoded transmission training pixel images scenes size codebook chosen order achieve applied exponential annealing schedule determined start split note transition matrix optimization responds embedding dimensional hypercube dimensional data space tested resulting encoding test image figure determining codebook simulating transmission indices noisy binary symmetric channel error rate reconstructing image codebook results summarized figure shows plot function rate stvq dots ssom vertical crosses crosses stvq shows performance high naturally superior account channel noise ssom performs slightly worse approx stvq fact ssom computationally demanding stvq story found httpwww annealed selforganizing source channel coding encoding convolution demonstrates efficiency ssom source channel coding figure shows generalization behavior ssom codebook optimized codebook optimized performs worse trained ssom values performs values values trained noisy case performed robustness channel noise achieved expense optimal data representation noise free case figure finally performance vector reconstruction stvq slightly ssom superior reconstruction figure comparison differ vector image noisy channel sion reconstruction plot shows defined func tion rate stvq ssom optimized channel noise ssom optimized train consisted pixel images codebook size anneal schedule test image vergence parameter conclusion presented algorithm noisy vector quantization based deterministic annealing stvq phase transitions annealing process analysed class vector derived standard algorithms soft versions special cases stvq fuzzy version kohonens introduced computationally efficient stvq yields good results demonstrated noisy vector quantization image data annealing scheme opens possibilities usage neighborhood function represents nontrivial neighborhood relations acknowledgements work supported berlin advice regard image processing references buhmann hofmann robust vector quantization competitive learning proceedings icassp munich study vector quantization noisy channels ieee transactions information theory obermayer original stvq ssom figure transmitted binary symmetric channel encoded reconstructed vector quantization algorithms obermayer phase transitions stochastic selforganizing maps physical review heskes kappen selforganizing nonparametric regression artificial neural networks kohonen springerverlag derivation principles class learning algorithms proceedings washington analysis selforganizing maps neural computation miller rose combined vector quantization annealing ieee transactions communications rose statistical mechanics phase transitions clustering physical review letters
1 skeletonization technique trimming network relevance assessment michael mozer paul smolensky department computer science institute cognitive science university colorado boulder abstract paper proposes means knowledge network determine functionality relevance individual units purpose understanding networks behavior improving performance basic idea iteratively train network tain performance criterion compute measure relevance input hidden units critical performance automatically relevant units skeletonization tech nique simplify networks eliminating units redundant information improve learning performance learning hidden units trimming unnecessary constraining generalization understand behavior networks terms minimal rules introduction thing connectionist networks common brains open inside internal organization number units connections techniques hierarchical cluster analysis sejnowski rosenberg suggested step understanding network behavior handle role individual units play paper proposes means knowledge work determine functionality relevance individual units measure relevance unit relevant units automatically network construct skeleton version network skeleton networks potential applications constraining generalization eliminating input hidden units serve pose number parameters network reduced generalization constrained improved learning learning fast hidden units large number hidden units generalizations learning slower mozer smolensky hidden units generalization idea learning train network hidden units eliminate irrelevant lead rapid learning training gradually improvement generalization performance understanding behavior network terms rules wishes handle behavior network analyzing network terms small number rules enormous number parameters situations prefer simple network performed correctly cases plex network performed correctly skeletonization process cover simplified network researchers chauvin hanson pratt david rumelhart personal communication studied techniques closely related problem number free parameters back propagation networks approach involves adding extra cost terms usual error function weights units decay approach removal units gradient descent procedure motivation approach twofold initial interest designing procedure serve focus attention important units explicit relevance metric needed matter balance mary secondary error term determine relative weighting terms adjusted learning experience impossible avoid local minima compromise solutions partially satisfy error terms conclusion supported experiments hanson pratt determining relevance unit multilayer feedforward network determine unit serves important function network obvious source information outgoing connections unit layer connections expect activity impact higher layers effects connections cancel large input units influence units saturation outgoing connections units small unit constant activity case replaced bias units accurate measure relevance unit needed happen performance network unit removed network unit versus unit straightforward measure relevance unit unit error network training problem measure compute error unit removed complete pass made training cost computing stimulus presentations number units network number patterns skeletonization training training fixed additional difficulties arise computing find good approximation presenting approxi mation introduce additional notation suppose asso unit coefficient represents attentional strength unit figure coefficient thought flow activity unit activity unit connection strength squashing function unit influence rest network unit conventional unit terms relevance unit rewritten approximate derivative error respect assuming equality holds approximately approximation figure network coefficients input hidden units derivative computed error propagation procedure similar adjusting weights back propagation additionally note approximation assumes changed actual parameters system notational convenience mozer smolensky estimating relevance practice found strongly time stable esti mate yields results time average derivative simulations reported measure detail relevance assessment mention relevance puted based linear error function index terns output units target output actual output usual error function poor estimate relevance pattern close target difficulty mozer smolensky results reported error metric training weights conventional back propagation measured involves separate back propagation phases computing weight updates relevance measures simple salience problem network inputs labeled hidden unit output generated training patterns correlations input unit output shown table task hidden layer inclusion hidden unit simply allowed standard threelayer architecture tasks subsequent simulations unit activities range input target output patterns binary vectors training continues output activities acceptable margin target additional details training procedure network parameters mozer smolensky perform network attend input inputhidden connections weights qualitative correlations table contrast relevance values input table input unit correlation output unit inputhidden connection strengths values reported table average simulation initial weights averaging signs weights connection negative skeletonization units show highly relevant negligible relevance qualitative picture presented profile identical weights reflect statistics training func units problem network binary inputs labeled binary output task learn function output unit special case inputs hidden units back propagation arrives solution unit responds rule exception unit relevant solution accounts cases unit accounts fact reflected replications simulation values extremely reliable standard errors relevance measured quadratic error function metric unit incorrectly judged relevant unit mentioned basis failure quadratic error function true relevance output error exception pattern learned error patterns significantly lower relevance values computed basis patterns smaller puted basis exception pattern results relevance assessment derived exception pattern relevance incorrect relevance assignments problem avoided assessing relevance linear error function attempted network eliminating hidden units logical candidate relevant unit trimming process leave simpler network skeleton network behavior characterized terms simple rule account input cases constructing skeleton networks remaining examples construct skeleton networks relevance metric procedure train network output unit activities margin target details mozer compute unit remove unit smallest repeat steps number times examples chosen input units hidden units simultaneously reason addressed crucial question work present advance stop trimming makes values mozer smolensky source information informative magnitudes large increase minimum trimming progresses ming performance network train problem task determining rule east trains west trains figure simple rules simple sense rules require minimal number input features east trains long triangle load open white features describe train essential making discrimination network trained task back propagation learns quickly final solution takes consideration inputs features tially correlated discrimination skeletonization procedure applied number inputs network successfully minimal input features long triangle load open white replications simulation trimming task trivial expected success rate random removal inputs skeletonization procedures experimented resulted success rates problem network learns behave task binary inputs labeled binary output inputs output values logical function computed east west figure train problem adapted skeletonization table median epochs median epochs architecture failure rate criterion criterion hidden hidden standard skeleton standard unit back network tested work began hidden units initially skeleton work network reach performance criterion training epochs assumed network stuck local minimum counted failure performance statistics networks shown table averaged standard network fails reach criterion runs skeleton network obtains solution hidden units solution lost hidden layer units skeleton network hidden units reaches criterion half number training epochs required standard network point hidden units time skeleton work network retrained criterion nonetheless total number epochs required train initial hidden unit network required standard network units hidden units performance skeleton network remains close improvement learning substantial random mapping problem problem random input vectors random element output vectors twenty random inputoutput pairs training training sets generated tested standard unit network tested skeleton network training architecture simulation criterion reached training epochs assumed network stuck local minimum counted failure table shows standard network failed reach criterion hidden units runs skeleton network failed hidden layer units runs training sets failure rate network lower standard network networks required amounts training reach criterion hidden units skeleton work reaches criterion hidden units performance significantly network results parallel report median epochs criterion epochs avoid caused large number epochs failure runs mozer smolensky table standard network skeleton network median epochs median epochs median epochs training failures criterion failures criterion criterion hidden hidden hidden summary conclusions proposed method knowledge network determine relevance individual units relevance metric identify input hidden units critical performance network relevant units construct skeleton version network skeleton networks application scenarios simulations demon strated understanding behavior network terms rules salience problem relevance metric input sufficient solve problem inputs conveyed redundant information problem relevance metric distinguish hidden unit responsible correctly handling cases general rule hidden unit dealt case train problem relevance metric correctly discovered minimal input features required describe category improving learning performance standard network unable discover solution skeleton network failed skeleton network learned training quickly random mapping problem problem skeleton work succeeded considerably comparable learning speed training required reach criterion initially basically skeletonization technique network input hidden units learn training examples rapidly gradually units discover concise characterization underlying regularities task process local minima avoided increasing learning time skeletonization surprising result ease network recover unit removed conventional network excess units training making hidden units simulations network solution hidden units training removal hidden unit drop performance criterion case appears easy path solution units solution fewer presented skeletonization technique trimming units work reason similar procedure operate individual connec tions basically coefficient required connection computation yann personal communication independently developed procedure similar skeletonization technique operates individual connections acknowledgements conversations work dave goldberg geoff hinton yann feedback eric saving computer work supported grant sloan tion geoffrey hinton grant james mcdonnell foundation michael mozer sloan foundation grant grants paul smolensky references chauvin backpropagation algorithm optimal hidden units advances neural network information processing systems mateo morgan kaufmann hanson pratt comparisons constraints minimal network construction back propagation advances neural network information processing systems mateo morgan kaufmann constraints preferences inductive learning experimental study human machine performance cognitive science mozer smolensky skeletonization technique trimming network relevance assessment technical report boulder university colorado department computer science sejnowski rosenberg parallel networks learn english text complex systems
10 canonical distortion measure feature space classification jonathan peter bartlett department systems engineering australian national university canberra australia abstract prove canonical distortion measure optimal distance measure classifi cation show reduces squared euclidean distance feature space function classes expressed linear combinations fixed features bounds sample complexity required learn experiment presented neural network learnt japanese environ ment classification introduction input space distribution class functions mapping called environment distribution function canonical distortion measure inputs defined paper realvalued functions squared loss introduced analysed primarily vector quantization perspective proved optimal distortion measure vector quantization sense producing approximations functions environment experimental results presented domain showing learnt purpose paper investigate utility classification tool section show class functions common feature author supported part epsrc grants baxter bartlett reduces change variables squared euclidean distance feature space lemma showing optimal distance measure classification functions common feature optimal classification achieved squared euclidean distance feature space general unknown section present technique learning minimizing squared loss give bounds quired good generalisation section present experimental results features learnt japanese environment squared euclidean distance classification feature space exper provide strong empirical support theoretical results difficult realworld application feature space suppose expressed linear combination fixed features exists case distribution environment distribution weight vectors measuring distance function values matrix making change variable assumption functions environment expressed linear combinations fixed features means simply squared euclidean distance feature space related original linear transformation classification suppose environment consists classifiers functions function training examples classification classification computed classification classification nearest training point distance measure chosen random expected misclassification error scheme training points nearest neighbour lemma definitions lemma sequences remarks lemma combined results section shows function classes common feature optimal classification achieved squared euclidean distance feature space section experimental results japanese presented supporting conclusion property optimality classification stable small perturbations learn approximation canonical distortion measure feature space classification small case classification small show stability maintained classifier environments positive examples functions overlap significantly case japanese environment section face recognition environments speech recogni tion environments investigating general conditions stability maintained learning environments encountered practice speech recognition image recogni tion unknown section shown estimated learnt function approximation techniques feedforward neural networks sampling environment learn learner provided class functions neural networks maps goal learner find error small sake argument error measured expected squared loss expectation respect learner provided training data form data minimize empirical version unknown generate data form estimated training pair generate training sets learning distribution environment distribution input space sampled samples samples pair estimate training triples data generate empirical estimate training triples functions assumed symmetric satisfy case experiment presented neural network class minimized directly gradient descent section present alternative technique features learnt environment estimate feature space constructed explicitly baxter bartlett uniform convergence ensure good generalisation minimizing sense small theorem shows occurs number functions number input samples sufficiently large nonetheless benign restrictions statement theorem state theorem denotes smallest norm exists theorem assume range functions environment class approximate proof define triangle inequality hold treat separately equation simplify notation denote canonical distortion measure feature space classification defined bound theorem equation loss suppose trick split pairs double exist broken proven tion conditional pairs results realvalued function learning squared loss union bound setting statement theorem ensures remark bound number functions sampled environment independent complexity class related bias learning equivalently learning learn results number functions depend complexity heuristic explanation learning distance function input space bias learning learning entire hypothesis space environment section classes problems learn functions environment cases learning effective method learning learn experiment japanese verify optimality classification show learnt nontrivial domain baxter bartlett learnt japanese environment specifically func tions environment classifier kanji character database segmented kanji characters scanned sources group state university york quality images ranged clean degraded main reason choosing japanese english testbed large number distinct characters japanese recall theorem good generalisation learnt sufficiently functions sampled environment environment consisted english characters sufficiently characters characters impossible test learnt characters training learning directly minimizing learnt implicitly learning neural network features functions environment features learnt method outlined essentially involves learning classifiers common final hidden layer features learnt classifiers environment data training testing resulting classifier linear combination neural network features average error classifiers test accurate estimate test examples recall section expressed fixed feature reduces result learning procedure features weight vectors character classifiers training empirical estimate true variable classification test examples linear change experiments experiment testing training examples training characters extra category purpose clas siftcation test examples label nearest neighbour training initially training examples mapped feature space give test mapped feature space assigned label total misclassification error directly compared misclassification error original clas training data explicitly information stored network make comparison classifiers information network learnt classification improvement error classifier error classifier indication optimal distortion measure classifi cation experiment classification test time characters distinguished case learnt asked distinguish characters treated single character trained misclassification error surprisingly error compares error achieved data group carefully selected feature routine case distance measure learnt input subject optimization canonical distortion measure feature space classification figure kanji characters character examples nearest neighbours remaining characters final qualitative assessment learnt compute tance pair testing examples distance pair characters individual character represented number testing examples computed averaging distances constituent examples neighbours character calculated measure character turned nearest neighbour cases neighbours strong subjective similarity original representative examples shown figure conclusion shown canonical distortion measure optimal distortion measure classification environments functions expressed linear combination fixed features canonical distortion measure squared euclidean distance feature space technique learning presented bounds sample complexity required good proved experimental results presented japanese environment learnt learning common features subset character classifiers environment learnt distance measure neigh classification performed remarkably characters train characters references jonathan baxter learning internal representations proceedings eighth international conference computational learning theory pages press jonathan baxter canonical metric vector technical report technical report royal college university london july jonathan baxter canonical distortion measure vector quantization func tion approximation proceedings international conference machine learning july bartlett williamson efficient agnostic learning neural networks bounded fanin ieee transactions information theory hong system reading handwritten page images symposium document image understanding technology
11 making templates invariant application rotated digit recognition baluja pittsburgh research center school computer science carnegie mellon university abstract paper describes simple efficient method make object classification invariant rotations task divided parts orientation discrimination classification idea orientation discrimination classification turn input image belongs class interest image rotated maximize similarity train images class prototype object upright tation process yields images object upright position resulting images classified models trained upright examples approach successfully applied realworld tasks rotated handwritten digit recognition rotated face detection scenes introduction rotated text commonly variety situations ranging official exam figure recognize digits characters regard rotation figure text axis aligned include baluja focus paper recognition rotated digits simplest method system recognize digits rotated employ existing systems designed upright digit recognition repeatedly rotating input image small increments applying recognition rotation digit eventually recognized discussed paper extremely computationally expensive approach error prone classification digit occur orientations likeli hood incorrect match high procedure presented paper make templates invariant faster accurate detailed descriptions procedure section section demonstrates applicability approach realworld task rotated handwritten digit recognition section paper conclusions suggestions future research briefly describes application method successfully applied face detection scenes making templates invariant process make templates invariant describe context binary classification problem extension multiple classes discussed section imagine simplified version digit recognition task detector single digit suppose input digit challenge rotated image plane arbitrary amount recognizing rotated objects step process step work applied input image network input detection network input network returns digits angle rotation window rotated negative angle make upright note network require input image encountered network return unspecified tion rotation yield image resulting image detection network detect hand rotated detected detection network rotated upright position network subsequently detected detection network detection network trained output positive input upright negative rotated noted methods require neural networks shown number classifiers detection networks sequentially input image processed network returns angle rotation assuming image simple geometric transformation image performed rotation original image contained upright resulting image passed detection network original image contained successfully detected idea easily extended classification problems network trained object class recognized digit recognition prob networks trained digits classify digits upright single classification network outputs detection networks trained individual digits alternative approaches paper classification network standard manner output maximum classification classify image procedure making templates invariant digit pass image returns rotation angle rotate image returned rotation angle pass image classification network classification networks maximum output output activation output recorded digit eliminated candidate cases eliminate candidates cases candidate remain cases digit maximum recorded activation step returned event candidates remain system reject sample classify return maximum recorded step examples rejected network train networks images rotated digits input rotation angle target output examples rotated digits shown figure image pixels upright data sets database figure examples digits recognized group shown rotation appears data eighth examples show digit rotated random amounts classification network output represents distinct class stan dard output representation outputs represent continuous variable angle rotation outputs network gaus sian output encoding pomerleau output units gaussian training network activate single output encoding outputs close desired output activated proportion tance desired output representation avoids imposed discontinuities strict encoding images similar slight differences rotations representation finer granularity number output units encoding pomerleau network architecture classification networks consists single hidden layer unlike standard fullyconnected network hidden unit connected small patch input networks groups hidden units hidden unit connected patches inputs groups patches spaced groups overlapping patches similar networks baluja face detection unlike convolution networks weights hidden units shared note local receptive field configurations equivalent performance baluja rotated handwritten digit recognition create complete invariant digit recognition system step ment digit background recognize digit segmented systems proposed segmenting written digits back ground clutter jain kanade paper concentrate recognition portion task segmented image potentially rotated digit recognize digit experiment conducted establish baseline performance standard upright training train classification network training consists digits network tested testing testing digits addition measuring performance upright testing entire testing rotated expected performance rapidly degrades tion graph performance respect rotation angle shown figure figure performance classification network trained upright images tested rotated images angle rotation increases performance degrades note spike degrees digits peak performance approximately digits interesting note rotation performance slightly rises digits symmetric center horizontal axis digits recognized orienta tions upright detector works digits mentioned earlier simplest method make upright digit classifier handle tions repeatedly rotate input image classify rotation draw back approach severe computational expense drawback digit examined rotations similar numerous digits orientations approach avoid problem classify digit examined rotations ensure process biased size increments image rotated angle increments shown table method yields table exhaustive search rotations number angle exhaustive search method frequent vote rotations frequent vote counted votes positive rotations note empirical comparisons presented convolution networks performed extremely upright digit recognition task limited computation resources unable train networks takes days train network trained hours approximately misclassification rate upright test networks reported error noted networks trained study easily conjunction classification procedure including convolutional networks templates invariant classification accuracies reason vote counted clas network predicts outputs network trained predict digit recognized experiment repeated modification vote counted maximum output classification network result shown table classification rate improved baseline performance measures quantitative measurements compare effectiveness approach paper formance procedure networks single classi fication network shown figure note unlike graph shown figure effect classification performance rotation angle figure performance combined rotation network classification network system proposed paper note performance largely unaffected rotation average performance rotations provide intuition networks perform figure shows examples networks transform digit work suggests rotation makes digit network trained suggest rotation make input digit digit effect digit original digit digit rotated digit rotated digit rotated digit rotated digit rotated digit rotated digit rotated digit rotated digit rotated digit rotated figure digits rotated angles networks expected method working digits diagonal upper left bottom upright approach train single network handle rotation classification rotated digits inputs digits classification target output experiments approach yielded results techniques presented baluja shown figure average classification accuracy approximately performance good upright case peak performance approximately figure high level performance achieved upright case rotated digits rotations admissible characters ambiguous problem working correctly suggest angle rotation make input image digit rotation cases input image digit rotation image cases shown figure digit transformed classification error errors instances hope correcting figure presents complete confusion matrix examples figure digit rotated similar nonetheless remain distinctive features real differentiated rotated classification network unable make distinctions trained examples remember classification work trained upright digit training rotated training reflects fundamental discrepancy procedure distributions images train classification network distributions network tested address problem classification mechanism modified single neural network classifier previously individual detection networks detection network single binary output input digit upright network trained network paired respective detection network crucial point training original upright images training image positive negative passed makes training difficult digits rotated distribution training images matches testing distribution closely image presented passed network pairs candidate digits eliminated binary output detection network signal detection preliminary results approach extremely promising classification accuracy increases dramatically averaged tions reduction error previously approach predicted digit original image image rotated mistake digit rotated correct digit figure errors confusion matrix entries account entries filled ease reading errors made classification examples errors shown making templates invariant conclusions future work paper presented results difficult problem rotated digit recognition presented baseline results naive approaches checking rotations approaches slow large error rates sented results twostage approach faster effective naive approaches finally presented preliminary results approach closely models training testing distributions recently applied techniques presented paper detection faces scenes previous studies presented methods finding upright frontal faces rowley techniques presented detect frontal faces including rotated image plane methods presented paper directly applicable full alphabet rotated character recognition paper examined digit individually straightforward method eliminate ambiguities similar digits contextual informa tion surrounding digits rotated amount strong hints rotation nearby digits realworld cases expect digits close upright method incorporating information penalize matches rely large rotation angles paper presented general make recognition rotation invariant study rotation estimation procedures recognition templates implemented nonetheless classification technique implements form templates correlation templates support vector machines probabilistic networks knearest neighbor principal methods easily employed acknowledgements author reviews successive paper references baluja face detection rotation early concepts preliminary results pittsburgh research center technical report guyon personnaz dreyfus denker lecun comparing neural architectures classifying handwritten digits ijcnn jain automatic text location images video frames jackel bottou cortes denker drucker guyon miller simard vapnik learning algorithms classification comparison handwritten digit recognition neural networks statistical mechanics perspective lecun jackel bottou cortes denker drucker guyon muller simard vapnik comparison learning algorithms handwritten digit recog nition lecun boser denker henderson howard hubbard jackel handwritten digit recognition backpropagation network advances neural information processing systems nips touretzky david morgan kaufman handwritten digit recognition neural networks computation pomerleau mobile robot guidance kluwer academic rowley baluja kanade neural face detection pattern analysis machine intelligence pami january rowley baluja kanade rotation invariant neural face detection proceedings computer vision pattern recognition sato kanade hughes smith video digital news international workshop access image video databases kanade association face video proceedings ference computer vision pattern recognition
12 mixture density estimation jonathan department statistics yale university haven yale andrew barron department statistics yale university haven andrew abstract gaussian mixtures socalled radial basis function networks density estimation provide natural counterpart sigmoidal networks function fitting approximation cases give simple expressions iterative improve ment performance components network introduced time mixture density estimation show mixture estimated maximum likelihood iterative likelihood improvement introduce achieves loglikelihood order loglikelihood achievable convex combination consequences approximation kullbackleibler risk minimum description length principle selects optimal number compo nents minimizes risk bound introduction density estimation gaussian mixtures provide representations densities model heterogeneous data high dimensions introduce index regularity density functions respect mixtures densities family mixture models components shown achieve kullbackleibler approximation error bounded manner analogous treatment sinusoidal networks barron find classes density functions reasonable size works exponentially large function input dimension achieve suitable approximation estimation error parametric family probability density functions parameterized class density functions mixture representation form density functions probability measure main theme paper give approximation estimation bounds arbitrary densities finite mixture densities focus attention densities barron inside give approximation error bound finite mixtures approximation error measured kullbackleibler divergence densities defined density estimation natural distance function fitting literature invariant scale transformations transformation variables intrinsic connection maximum likelihood methods mixture density estimation result quantifies approximation error theorem exists mixture bound characterizes upper bound ratio densities parameters restricted variable note rate convergence related dimensions behavior constants depends choices target gaussian location family restrict cube likewise restrict parameters cube case linear dimension depends target density suppose finite mixture components equality components disjoint suppose deal similar setting kullbackleibler proximation bound order onedimensional mixtures gaussians general case necessarily competitive optimality result density approximation good mixture density estimation theorem obtain bound theory information projection shows exists sequence converges function achieves note necessarily element developed building work bell consequence theorem smallest limit sequences achieving approaches prove theorem induction section appealing feature approach iterative estimation procedure estimate component time greedy procedure shown perform procedures computational task estimating component considerably easier estimating full mixtures section iterative construction suitable approximation section shows mixtures estimated data risk bounds stated section iterative construction approximation provide iterative construction fashion suppose discussion approximation seek mixture close initialize choosing single component minimize suppose chosen minimize generally sequence mixtures prove sequences achieve error bounds theorem theorem familiar iterative hilbert space approximation results bartlett follow similar strategy distance measures density approximation involves norms component densities exponentially large dimension naive taylor expansion kullbackleibler divergence leads norm approxi mation weighted reciprocal density difficulty remains challenge adapt iterative approximation kullbackleibler divergence manner permits constant bound involve logarithm density ratio ratio manageable constants barron proof establishes inductive relationship bounded choosing easy induction establish quadratic upper bound analytic inequalities logarithm note application ratio densities obtain upper bound involving find defined theorem case taking expectation respect sides acquire quadratic upper bound noting note function greedy algorithm chooses minimize upper bound apply negative part logarithm proof inequality monotone decreasing inequalities shown separately cases limit inequalities multiplies takes derivatives obtain suitable monotonicity moves apply inequality arbitrary density note case side expand square density estimation chosen satisfy shown desired inductive relationship case mixture representation form convex hull analysis yields desired completes proof theorems greedy estimation procedure connection divergence helps motivate estimation procedure data sampled iterative construction turned sequential maximum likeli hood estimation changing step surprising result resulting estimator likelihood high likelihood achieved density difference order formally state empirical distribution proof result proof section expectation respect respect density lets computation step benefits greedy procedure bring chosen maximize simple component mixture problem components fixed achieve bound chosen iterative maximum likelihood held fixed step equal replace component mixture successive mixtures resulting estimate guaranteed high likelihood achieved mixture density barron disadvantage greedy procedure number steps adequately poor initial choices step tune weights convex combinations previous components adjust locations components case result previous iterations components provide natural initialization search step good news long chosen component mixtures achieve likelihood large choice achieving require conclusion follow likelihood results risk bound results apply case global likelihood mixtures case result greedy procedure risk bounds iterative metric entropy family controlled obtain risk bound determine coordinates parameter space allowed represented specifically lipschitz condition assumed coordinate parameter vector note condition satisfied gaussian family restricted cube location parameter prescribed cube variance state bound risk theorem assume condition assume cube side length likelihood mixtures generally sequence density estimates satisfying bound risk choice order roughly leading bound order logarithmic factors bound occurs unknown importantly chosen optimize upper bound risk balance approximation estimation sources error occur optimize penalized likelihood criterion related minimum description length principle barron cover function satisfies mixture density estimation penalized procedure picks minimizing proof risk bounds builds general results maximum likelihood penalized maximum likelihood procedures recently established randomized algorithm estimating tures gaussians case data drawn finite mixture separated gaussian components common covariance runs time linear dimension quadratic sample size present forms algorithm require large sample sizes accurate estimates density techniques work general mixtures iterative likelihood maximization relationship accuracy sample size number components references barron andrew universal approximation bounds superpositions sigmoidal function ieee transactions information theory barron andrew approximation estimation bounds artificial neural networks machine learning chris rates convergence gaussian mixture manuscript jonathan estimation mixture models dissertation department statistics yale university bell robert cover thomas optimal science jones simple lemma greedy approximation hilbert space convergence rates projection pursuit regression neural network training annals statistics bartlett williamson efficient agnostic learn neural networks bounded fanin ieee transactions information theory meir density estimation convex densities approximation estimation bounds neural networks jonathan iterative estimation mixture models department statistics yale university barron andrew cover thomas minimum complexity density estimation ieee transactions information theory learning mixtures gaussians proc ieee conf foundations computer science
2 predicting weather genetic memory predicting weather genetic memory combi nation kanervas sparse distributed memory genetic algorithms david rogers research institute advanced computer science nasa ames research center field abstract kanervas sparse distributed memory model based mathematical properties highdimensional binary address spaces genetic algorithms search tech nique highdimensional spaces inspired evolutionary processes genetic memory hybrid systems memory genetic algorithm dynamically recon figure physical storage locations reflect correlations stored addresses data presented weather station data genetic memory discovers specific tures weather data correlate rain memory utilize information effectively architecture designed maximize ability system handle realworld problems introduction future success neural networks depends ability small networks lowdimensional problems networks thousands nodes highdimensional realworld problems dimensionality problem refers number variables needed describe problem domain neural networks shown scalable realworld problems remain restricted specialized applications adds types computational demands system linear increase computational demand proportional increased number vari ables greater nonlinear increase computational demand rogers number interactions occur variables effect primarily responsible difficulties encountered systems general difficult system specifically designed function highdimensional domains systems designed function highdimensional domains kanervas sparse distributed memory kanerva genetic algorithms holland hypothesized hybrid systems preserve ability operate highdimensional environments offer func individually call hybrid genetic memory test capabilities applied problem forecasting rain local weather data kanervas sparse distributed memory model based mathematical properties highdimensional binary address spaces represented threelayer neuralnetwork extremely large number nodes middle layer standard formulation connec tions input layer hidden layer input representation system learning changing values connections hidden layer output layer genetic algorithms search technique highdimensional spaces evolutionary processes members binary strings opportunity selecting successful members population parents string created pieces parent finally string string removed genetic memory hybrid systems hybrid genetic gorithm connections input layer layer connections hidden layer output layer changed standard method sparse distributed memory success input representation determined reflects correlations tween addresses data previously presented work statistical predic tion rogers separate learning algorithms memory genetic algorithm dynamically input representation reflect correlations collections input variables stored data applied genetic memory architecture problem predicting rain local weather features pressure cover month temperature weather data contained features sampled period australian coded state stored address single denoted hours weather state allowed genetic algorithm memory scanned file weather states success procedure measured ways training completed genetic memory predicting rain stan dard sparse distributed memory access input representations discovered genetic memory view specific combinations tures predicted rain unlike neural networks genetic memory user internal representations discovers training predicting weather genetic memory reference address location addresses radius input data data counters output data figure structure sparse distributed memory sparse distributed memory sparse distributed memory illustrated variant structure addresses data shown figure memory location figure location addresses random addresses data counters initialized operations begin memory entails hamming distance refer ence address location addresses distance equal hamming radius entry location termed selected ensemble selected locations called selected selection noted figure rows radius chosen small percentage memory locations selected reference address writing memory selected counters elements input equal incremented selected counters elements input data equal completes write operation reading memory selected data counters summed sums greater equal corre sponding output data output data reading contents input data rogers makes clear datum distributed data counters locations writing datum reconstructed reading averaging sums counters depending additional data written selected locations depending data correlate original data reconstruction noise model fullyconnected threelayer feedforward neural network model location addresses weights input layer hidden units data counters weights hidden units output layer note number hiddenlayer nodes possibly larger commonly artificial neural networks unclear standard algorithms back propagation perform large number units hidden layer genetic algorithms genetic algorithms search technique highdimensional spaces inspired evolutionary processes domain genetic algorithm tion binary strings fitness function method fitness members fitness function select members member placement selection absolutely worst members selected members chosen proportional fitness scores member selected removed population members good create member place population effect genetic algorithm search highdimensional space strings fitness function process create members population called crossover crossover align good candidates segment create string starting transcription bits parent strings switching transcription parent string population taking place member parent parent member figure crossover binary strings running genetic algorithm population times population evolves members rated fitness function predicting weather genetic memory layer unit layer weights changed perceptton rule weights changed genetic algorithm figure structure genetic memory holland crossover procedure extremely efficient method searching highdimensional space genetic memory genetic memory hybrid kanervas sparse distributed memory genetic algorithms hybrid location addresses held constant genetic algorithm move advantageous positions address space view neural hybrid algorithm change weights connections input layer hidden unit layer connections hidden unit layer output layer changed standard method work combined neural networks genetic algorithms networks genetic algorithm successful networks create entire networks genetic memory single network algorithms changing weights layers genetic memory incorporates genetic algorithm directly operation single network australian weather data weather data collected single site australian sample hours years weather samples file contained distinct features including year month month time pressure bulb temperature bulb temperature point wind speed wind direction cover past hours work coded weather sample word weather coded binary address giving feature field address feature values coarsecoded simple code figure shows code month procedure weather prediction standard model locations addresses held constant genetic memory location addresses genetic algorithm rogers fitness function based work statistical prediction presented rogers work assigns number physical storage tion figure measure location locations crossover tion address location data counter measure correlation selection location occurrence data counters judge ness memory location train memory present memory weather state memory shown data multiple number times state addressed address represents written memory rain hours number weather samples genetic algorithm formed replace location address created predictive addresses procedure continued memory weather samples performed genetic analysis results initial results genetic memory procedure conducted memory storage locations weather sample consisted sequence weather samples hours period years sample hours samples hours samples genetic memory testing storing weather samples samples memory order storage memory genetic genetic memo sparse distributed memory tested previously unseen weather samples initial experiments genetic memory fewer errors sparse distributed memory genetic memory show improvement performance user analyze memory locations memory improved performance studying memory locations genetic memory open black access parameters memory decided tive associating sample addresses sample data ability access parameters system found effective important implications predicting weather genetic memory parameters offer insights underlying physical processes study knowledge system predicts robustness envelope applicability memory prior embedding realworld system simply scoring performance system open black study system performs opening black training completed analyze structure memory locations performed discover features found values features preferred memory location rated predicting rain training measuring distance field values field discover values feature desired closer hamming distance absolute range values sensitivity cation feature dimension figure shows analysis field month memory location locations field values months distance desirable desirable feature month values figure analyzing location field case location finds january february desirable months rain july august desirable months relative sensitivity features measures features important making prediction case change tance bits makes location sensitive month estimate features important predicting rain relative sensitivity fields location feature graphs show sensitive features previously shown memory location predicting rain location tive combination fields proper values rogers cover high bulb pressure month figure sensitive features preferred values fields minima graphs exam location greatly prefers january february june july pref location month january february high cover temperature surprisingly hours important features location sensitive features graphs show sensitive features memory location predicting rain location insensitive values features year wind direction bulb figure sensitive features fields expect unimportant year fields wind direction unimportant location inter locations find regions weather space predicting weather genetic memory comparison davis method davis algorithm shown powerful method power system attempt contrast approaches importance work reader referred book detailed information approach difficult directly compare performance techniques nature experiments genetic memory sible compare architectural features systems estimate relative strengths weaknesses backpropagation associative memories davis approach relies formance backpropagation algorithm central learning cycle associative memories learning cycle backpropagation networks shown characteristics training domains system based associative memory share advantages system based backpropagation issues backpropagation networks remain simple build backpropagation networks thousands hundreds thousands units contrast kanervas sparse distributed memo specifically designed massive construction implementation connection machine hidden units genetic memory shares property unity davis algorithm levels processing level consists standard backpropagation networks networks genetic memory incorporated algorithms single network algorithms operating simultaneously intuition algorithms suited layers neural network layers large fanout input layer hidden units driven algorithm suited highdimensional searching genetic algorithms selforganizing system layers large fanin layer output layer driven hillclimbing algorithms backpropagation conclusions realworld problems highdimensional large numbers dependent variables algorithms specifically designed func tion highdimensional spaces genetic memory algorithm genetic memory sharing features davis approach differences make problems easier scale node systems incorporation genetic algorithm improves recall performance standard associative memory structure genetic memory user access covered genetic algorithm assist making associations stored memory rogers acknowledgments work supported part cooperative national aeronautics space administration nasa space research association funding related connection machine jointly provided nasa defense advanced research projects agency darpa involved helpful work grateful entire group work olshausen kanerva payoff clear finally wait references davis genetic algorithms simulated annealing london england publishing holland adaptation natural artificial systems arbor michigan press holland possibilities learning algorithms applied parallel rulebased systems machine learning artificial intelligence approach volume mitchell california morgan kaufmann kanerva search unified theory memory center study language information report kanerva sparse distributed memory cambridge mass press rogers david improve performance kanervas sparse distributed memory research institute advanced computer science technical report nasa ames research center rogers david kanervas sparse distributed memory associative memory gorithm wellsuited connection machine highspeed rogers david statistical prediction kanervas sparse distributed memory advances neural information processing systems mateo morgan kaufman
7 simplifying neural nets discovering flat minima hochreiter schmidhuber informatik technische germany abstract present algorithm finding complexity networks high generalization capability algorithm searches large connected regions socalled fiat minima error func tion weightspace environment fiat minimum error remains approximately constant fiat minima shown correspond expected overfitting algorithm requires computation order derivatives order complexity experiments feedforward recurrent nets application stock market prediction method outperforms conventional backprop weight decay optimal brain surgeon introduction previous algorithms finding complexity networks high generalization capability based significant prior assumptions broadly classified assumptions prior weight distribution hinton camp williams assume posterior distribution learning close prior leads good generalization weight decay derived gaussian priors nowlan hinton assume networks similar weights generated gaussian mixtures priori mackays priors implicit additional penalty terms embody hochreiter schmidhuber assumptions made prior assumptions theoretical results early stopping network complexity carry applications examples methods based validation sets structural risk minimization methods wang approach requires prior assumptions approaches appendix basic idea flat minima search algorithm finds large region weight space property weight vector region similar small error regions called minima intuitive fiat minima interesting wolpert sharp mini corresponds weights high precision fiat minimum corresponds weights precision terminology theory minimum description length fewer bits information required pick fiat minimum simple principle suggests network complex corresponds high generalization performance unlike hinton method appendix approach depend explicitly choosing good prior algorithm finds fiat minima searching weights minimize training error weight precision requires computation hessian efficient order method obtain order complexity conventional backprop ically effectively reduces numbers units lines sensitivity outputs respect remaining weights units excellent experimental generalization results reported section task architecture boxes generalization task task approximate unknown relation inputs outputs function relation obtained adding noise outputs training information finite relation called training element denoted pair architecture simplicity focus standard feedforward experiments recurrent nets input units output units weights activation functions maps input vectors output vectors weight unit denoted weight vector denoted error squared error denotes euclidian norm denotes cardinality define regions weight space property weight vector region similar small error introduce error positive constant small error defined smaller implies fitting boxes weight satisfying defines acceptable mini interested large regions connected acceptable simplifying neural nets discovering flat minima regions called flat minima generalization error simplify algorithm finding large connected regions maximal connected regions focus socalled regions acceptable minimum weight space center simplicity edge parallel weight axis half length edge direction axis weight denoted awij maximal positive positive added subtracted component ously awij precision volume defined awij algorithm algorithm designed find defining maximal equivalent finding minimal note relationship number bits required describe weights appendix derive algorithm minimizes activation output unit constant positive variable ensuring ensuring expected decrease learning adjusting minimized gradient descent minimize compute shown efficient order method gradient computed time details algorithm order complexity standard backprop experimental results details experiment noisy classification experiment pearlmutter task decide point space exceeds class class noisy training examples generated data points obtained gaus sian bounded interval data points misclassified probability final input data obtained adding gaussian data points test data points found procedure leads cent hochreiter schmidhuber backprop approach backprop approach table comparisons conventional backprop method labeled shows squared error test shows difference fraction cent tions optimal fraction remaining rows provide information approach outperforms backprop misclassified data method cent inherent noise data training based fixed data points test based data points results conventional backprop nets tested equally networks based method fiat minima search epochs weights nets essentially stopped changing automatic early stopping backprop changing weights learn outliers data overfit approach left single hidden unit maximal weight xaxis input unlike backprop hidden units effectively pruned outputs yaxis input weight shown corresponds optimal minimal numbers units weights table illustrates superior performance approach experiment recurrent nets method works continually running fully recurrent nets time step recurrent sigmoid activations sees input vector stream randomly chosen input vectors task switch output unit input occurred time steps switch output unit delay response input task solved single hidden unit results conventional recurrent algorithms training hidden units store input vector approach trained networks learned perfect solutions weight decay weights output decayed unlike weight decay strong inhibitory connections switched hidden units effectively pruning experiment stock market prediction predict german stock market index based fundamental experiments technical experiment indicators strictly layered feedforward nets sigmoid units active performance measures confidence output positive tendency negative tendency performance incorrectly predicted subtracted simplifying neural nets discovering flat minima correctly predicted result divided absolute experiment fundamental inputs german interest rate industrial production divided supply business training examples test examples prediction confidence architecture experiment fundamental inputs rate foreign orders industry training examples test examples monthly prediction confidence architecture experiment technical inputs recent change relative strength index difference week statistic difference exponentially weighted week week training examples test examples predictions confidence architecture methods tested conventional backprop optimal brain surgeon weight decay fiat results method outperforms methods cent details appendix theoretical justification overfitting error analogy decompose generalization error overfit ting error fitting error significant fitting error empirical risk thought required define overfitting error relation optimal posterior weight distribution obtain training thing theoretical purposes suppose initialize weights learning training kullbackleibler distance measure information noise conveyed conjunction initialization conceptual setting defining overfitting error measure initialization matter heavily influence posterior overfitting error kullbackleibler distance posteriors expectation expected difference minimal description respect learning measure expected overfitting error relative section computing expectation range posterior scaled obtain distribution error respect hochreiter schmidhuber pick minimized pose additional prior assumptions implicit previous approaches make additional stronger assumptions section assumption minimum close maximum formal definition intuitively ensures training reduce error assumption peaks maxima sharp assumption holds weights model ensures regions error network output find nets fiat outputs conditions defined section condition ensures condition forces equal weight space directions cases linear made justified weights causing error perturbed causing significant perturbing weights components obtain expresses dependence suppressed convenience linear approximation justified condition defines output small linear approximation gradient section satisfy condition select flat condition degrees freedom left condition enforces directed errors obvious meaning volume minimize expected length center condition influences algorithm algorithm prefers weights important target output algorithm enforces equal sensitivity output units respect weights algorithm group hidden units relevance groups output units condition condition corresponds order derivative reduction ordinary sensitivity reduction linear justified choice equation solve equation simplifying neural nets discovering flat fixing insert equation replacing equation depend suppressed approximate section section approximated immediately leads algorithm equation approximation justified learning process enforces validity justification initially conditions valid small environment initial acceptable minimum search acceptable minima volume environments implies absolute values entries hessian decrease shown algorithm suppress values unit activations order activation derivatives contributions unit activation output weights inputs activation functions order derivatives bounded shown entries hessian decrease increase relation hinton camp hinton camp minimize terms conventional error variance distance posterior prior problem choose good prior contrast approach approach require good prior advance hinton camp compute variances weights units general linear approximation intuitively speaking weight variances related awij approach justify linear approximation references guyon vapnik boser bottou solla structural risk tion character recognition moody hanson editors advances neural information processing systems pages mateo morgan kaufmann hassibi stork order derivatives network pruning optimal brain surgeon cowan giles editors advances neural information processing systems pages mateo morgan kaufmann hinton camp keeping neural networks simple proceedings international conference artificial neural networks amsterdam pages springer hochreiter schmidhuber hochreiter schmidhuber flat minima search discovering simple nets technical report informatik technische theory generalization linearly weighted connectionist networks thesis cambridge university engineering department mackay practical bayesian framework backprop networks neural computation exact calculation product hessian matrix feedforward network error functions vector time technical report puter science department university denmark moody utans architecture selection strategies neural networks application corporate bond rating prediction editor neural networks capital markets john wiley sons murray edwards synaptic weight noise learning enhances faulttolerance generalisation learning trajectory cowan hanson giles editors advances neural information processing systems pages mateo morgan kaufmann nowlan hinton simplifying neural networks soft weight sharing neural computation pearlmutter fast exact multiplication hessian neural computation pearlmutter complexity general ization neural networks lippmann moody touretzky editors advances neural information processing systems pages mateo morgan kaufmann schmidhuber discovering problem solutions complex high generalization capability technical report informatik technische vapnik principles risk minimization learning theory moody hanson lippman editors advances neural information processing systems pages mateo morgan kaufmann wang venkatesh judd optimal stopping effective machine complexity learning cowan tesauro alspector editors neural information processing systems pages morgan mann mateo weigend rumelhart huberman generalization weight elimination application forecasting lippmann moody touretzky editors advances neural information processing systems pages mateo morgan kaufmann williams bayesian regularisation pruning laplace prior technical report school cognitive computing sciences university wolpert bayesian backpropagation functions weights cowan tesauro alspector editors advances neural information processing systems pages mateo morgan kaufmann
4 neural network gaussian mixture hybrid speech recognition density estimation yoshua bengio dept brain cognitive sciences massachusetts institute technology cambridge mori school computer science university canada speech technology center university denmark university computer science germany abstract subject paper integration multilayered artificial networks probability density functions gaussian mixtures found continuous density hidden markov models part paper present hybrid parameters system simultaneously optimized respect single criterion part paper study relationship density inputs network density outputs networks experiments presented explore perform density estimation anns introduction paper studies integration artificial neural networks prob ability density functions gaussian mixtures contin uous density hidden markov models anns considered multilayered recurrent networks hyperbolic tangent hidden units preprocessed data outputs observations parametric probability density function gaussian mixture view adaptive preprocessor gaussian mixture gaussian mixture statistical role transform input data efficiently gaussian mixture interesting situation input data points lower dimensional space case desired learns possibly nonlinear transformation compact representation bengio mori part paper briefly describe hybrid anns markov models continuous speech recognition details system found bengio hybrid free parameters simultaneously optimized respect single criterion recent years related combinations studied levin bridle bourlard wellekens approaches motivated observed advantages disadvantages anns hmms speech recognition bourlard wellekens bridle experiments phoneme recognition timit database proposed hybrid reported task study recogni tion spotting sounds continuous speech comparative results task show hybrid performs dynamic programming based duration constraints global optimization parameters system yielded performance separate optimization part paper attempt extend findings part order basic architecture anns gaussian mixtures perform density estimation establish relationship network input output densities describe experiments exploring perform density estimation system hybrid likelihood observations model depends continuous observations compute derivative optimization criterion respect observations criterion maximum likelihood observations maximum mutual information observations correct sequence observation instant vector output gradient optimize parameters backpropagation bridle bottou bengio bengio ways compute gradient experiments preliminary experiment performed prototype system based integration anns hmms initially trained based prior task decomposition task recognition phonemes large speaker population version timit continuous speech database purpose sentences regions training sentences test sentences train speakers test speakers classes considered phones speakerindependent recognition phonemes continuous speech difficult task phonemes made short nonstationary events similar consonants merged unit segments recognition system neural mixture hybrid speech recognition density estimation level level level initially initially specialized principal components networks lower broad phonetic classes levels speech preprocessing task figure architecture hybrid experiments anns trained backpropagation online weight update bengio speech knowledge design input output architecture system networks experimental based scheme shown figure architecture built levels approach select input parameters architectures depending phonetic features recognized level anns initially trained perform recognition broad classification delays recur rent connections trained recognize static articulatory features depends place articulation context phoneme delays recurrent connections design details bengio level acts integrator parameters generated specialized anns level linear network initially computes principal components concatenated output tors lower level networks experiment combined network weights level hmms distribution modeled gaussian mixture densities bengio details topology covariance matrix assumed diagonal observations initially principal components assumption reduces significantly parameters estimated iteration reestimation parameters parameters hybrid system ously tuned maximize criterion iterations simplicity implementation hybrid trained criterion experiments optimization theoretically performance observed marked improvement performance final global tuning explained fact nearby local maximum section maximization likelihood inputs network bengio mori likelihood attained initial starting point based prior separate training table comparative recognition results recognized deletions accuracy deletions insertions anns hmms order assess proposed approach improvements brought time alignment performance hybrid system evaluated compared simple post processor applied outputs anns standard dynamic programming models duration probabilities phoneme simple assigns symbol output frame anns comparing target output vectors actual output vectors resulting string remove short segments consecutive segments symbol dynamic programming finds sequence phones minimizes cost imposes constraints phoneme system observations energy signal derivatives comparative results systems summarized table density estimation section extension system previous section objective perform density estimation inputs maximizing criterion depends density outputs maximize likelihood inputs preprocessor gaussian mixtures part probability density function estimated representing spatially local functions kernels gaussians silverman explore global transformation performed order represent define notation relation input output theorem suppose random variable outputs deter parametric function random variable inputs vectors dimension outputs network neural mixture hybrid speech recognition density estimation jacobian transformation assume singular decomposition product values suppose modeled probability density function proof case change variable integral obtain case network outputs inputs order introduce intermediate transformation space dimension dimensions directly correspond define decompose onetoone mapping jacobian matrix composed columns perform change variables integral equation order make change variable variable conditional write multiplying integrals equations obtain substituting yields general result equation clear efficiently evaluate derivative respect network weights experiments section study empirically simpler case case equivalent knowing bengio mori figure series experiments density estimation data generated nonlinear input curve left input samples density input estimated maps density output estimated gaussian estimation parameters estimating approximate functions parameterized functions output class densities modeled gaussian mixture number ponents means variances mixing proportions nonlinear transformation choose defined architecture values weights order choose values gaus sian parameters maximize probability parameters data prior assumed maximize likelihood input data parameters preliminary exper logarithm likelihood data maximized optimal parameters defined argmax inputs samples order estimate density system computes derivative respect output gaussian mixture reestimate parameters algorithm depends expression equations differentiating equation respect yields derivative logarithm determinant computed simply bottou neural mixture hybrid speech recognition density estimation figure experiments density estimation left input samples density nonlinear gaussian output samples network transformation experiments series experiments verified transformation inputs improve likelihood inputs gradient ascent criterion find good solution experiments attempt model twodimensional data extracted speech database training data points shown left figure experiment diagonal gaussian experiment linear network diagonal gaussian experiment nonlinear network hidden units diagonal gaussian average likelihoods obtained test points experiments estimated input output pdfs experiment depicted figure white indicating high density black density series experiments addresses question gaussian mixture diagonal covariance matrix data linear hypersurface dimension anns outputs separate dimensions data varies greatly doesnt orthogonal intuitively appears case variance outputs dont vary data close determinant jacobian nonzero likelihood correspondingly tend infinity experiment series verified case linear networks data generated diagonal line space resulting network separated variant dimension invariant dimension output dimensions variance transformed data lying line parallel output dimension experiments nonlinear networks suggest networks solution separates variant dimensions invariant easily found gradient ascent show solution maximum possibly local likelihood experiment designed demonstrate input data shown figure artificially generated make solution network inputs hidden units bengio mori outputs input samples input density weights maximum likelihood displayed figure transformed input data weights points projected line parallel output dimension variation weights solution direction gradient learning rate small yielded improvement decrease likelihood conclusion paper studied architecture performs nonlinear transformation data analyzed output modeled gaussian mixture design incorporate prior knowledge problem task perform initial training subnetworks phoneme recognition experiments hybrid based architecture performed part paper shown input network relates outputs network objective work perform density estimation nonlocal nonlinear transformation data preliminary experiments showed estimation improve likelihood resulting respect gaussian studied system perform nonlinear analogue principal components analysis references bengio artificial neural networks application sequence recognition thesis school computer science university montreal canada bengio mori acoustic parameters continuous speech recognition artificial neural networks speech bottou applica tions doctoral thesis paris france bourlard wellekens speech pattern discrimination multi layer perceptrons speech language bridle training stochastic model recognition algorithms networks lead maximum mutual information estimation parameters advances neural information processing systems touretzky morgan kauffman levin word recognition hidden control neural architecture proceed ings international conference acoustics speech signal processing albuquerque april silverman density estimation statistics data analysis hall york
10 nonparametric multiscale statistical model natural images bonet paul viola artificial intelligence laboratory learning vision group technology square massachusetts institute technology cambridge email abstract observed distribution natural images uniform contrary real images complex important struc ture exploited image processing recognition analysis proposed approaches statistical modeling images limited complexity models complexity ages present nonparametric multiscale statistical model images recognition image denoising generatire mode synthesize high quality textures introduction paper describe multiscale statistical model capture structure natural images scales trained images recognize images generate images tasks efficient requiring seconds minutes workstation statistical modeling images reaches back duda hart statistical approaches provide unified view learning classification generation date generic efficient unified statistical model natural images approaches shown significant competence specific areas statistical model generic images markov random field geman geman mrfs define distribution bonet viola images based simple local interactions pixels mrfs successfully restoration images generatire prop erties weak inability mrfs capture long range frequency interactions pixels recently great deal interest hierarchical models helmholtz machine hinton dayan helmholtz machine trained discover long range structure easily applied natural images multiscale wavelet models emerged effective technique modeling realistic natural images techniques hypothesize wavelet transform measures underlying natural images assumed cally independent primary evidence conjecture coefficients wavelet transformed images uncorrelated entropy success wavelet compression insights noise reduction donoho simoncelli adelson driven texture synthesis heeger bergen main drawback wavelet algorithms assumption complete independence coefficients conjecture fact strong dependence wavelet coefficients image consistent observations bonet simoncelli multiscale statistical models multiscale wavelet techniques assume images linear transform tion statistically independent random variables image inverse wavelet transform vector random variable assumed independent distribution independent wavelet transforms developed share type multiscale structure wavelet matrix spatially localized filter shifted scaled version single basis function wavelet transform efficiently computed iterative convolution bank filters pyramid frequency images created image factor dimension convolution operation pass filter level series filter functions applied types filters computation linear transformation thought single matrix careful selection matrix constructed simoncelli convenient combine pixels feature images single vector expected distribution function image classes modeled attempt model space natural images case appears accurate highly rare cases large values donoho simoncelli adelson direct contrast distribution images gaussian difference distributions basis noise reduction algorithms reducing wavelet coefficients inverse wavelet transform similar putation forward wavelet transform nonparametric multiscale statistical image model noise signal specific image classes modeled similar methods heeger bergen input images empirical distribution observed generate texture sampled assumed independent empirical distributions generated images computed inverse wavelet transform bergen heeger approach build probabilistic model texture single image assume textures spatially ergodic expected distribution function position image result pixels feature image samples distribution combined heeger work current state texture generation figure textures notice technique generating smooth textures defined structure image structures sharp edges border rightmost texture modeled approach image features directly assumption wavelet coefficients image independent types natural images coefficients wavelet transform independent images long edges wavelets local frequency space long edge local frequency space result wavelet representation feature requires coefficients high frequencies edge captured small high frequency wavelets long scale captured number larger frequency wavelets model assumes coefficients independent accurately model images nonlocal features conversely model captures conditional dependencies coefficients effective chose approximate joint distribution coefficients chain coefficients occur higher wavelet pyramid condition distribution coefficients lower levels frequencies condition generation higher frequencies pixel image define parent vector pixel level pyramid number features generating coefficients independently define chain scale chain generation lower levels depend higher levels mumford related formal model generation process slightly complex involving iteration designed match pixel histogram implementation generating images figure incorporates discuss bonet viola figure synthesis results heeger bergen model input textures bottom synthesis results technique generating fine noisy textures generating textures require cooccurrence wavelets multiple scales figure synthesis results technique input textures shown figure subset elements computed assume ergodicity independent process starts pyramid choosing values points generated values level generated process continues wavelet coefficients generated finally image computed inverse wavelet transform important note probabilistic model made collection independent chains parent vectors neighboring pixels substantial overlap coefficients higher pyramid levels nonparametric multiscale statistical image model lower resolution shared neighboring pixels lower pyramid levels generation nearby pixels strongly dependent related approach similar arrangement generafive chains termed markov tree estimating conditional distributions additional descriptive power generafive model cost conditional distributions estimated observations choose directly data nonparametric fashion sample parent vectors image estimate conditional distribution ratio parzen window density subset parent vector information level level function vectors returns maximal values vectors similar smaller values vectors similar explored functions results presented function returns fixed constant coefficients tors threshold simple tion sampling straightforward find pick experiments applied approach problems texture generation texture recog nition target recognition signal denoising case results published approaches figure show results technique textures figure textures model features caused conjunction wavelets striking rightmost texture geometrical preserved model knowledge joint distribution constraints critical perceived appearance synthesized texture model measure similarity image measuring likelihood generating parent vectors image chain model image easy data sets texture test suite performance slightly higher techniques approach achieved correct classification compared achieved gaussian approach lattice test suite slightly difficult texture composition textures spatial frequencies approach achieved alternate method case gabor convolution energy method achieved gaussian mrfs explicitly assume texture unimodal distribution result achieve correct recognition measured performance types natural texture compared classification power model human observers humans discriminate textures extremely accurately bonet original shrinkage shrinkage residual residual figure original original image image corrupted white gaussian noise shrinkage results denoising wavelet shrinkage donoho adelson shrinkage residual residual error shrinkage result original notice error great deal interpretable structure denoising approach residual residual error errors structured test humans achieved accuracy approach achieved accuracy achieved strong probabilistic model images perform variety image processing tasks including denoising denoising observed image performed monte carlo averaging draw number sample images prior density compute likelihood noise image find weighted average images weighted average estimated ways image generated observation image denoising frequently relies generic image models simply enforce image smoothness priors leave residual noise remove original image contrast construct probability density model noisy image effect assume image redundant examples visual structures texture approach directly related redundancy image redundancy image parent structures resampled images significant likelihood original image redundancy image arise regular texture smoothly varying patch resampling freely average similar regions effect reducing noise images figure show results denoising approach nonparametric multiscale statistical image model conclusions presented statistical model texture trained exam images form model conditional chain scale wavelet coefficients cross scale distributions estimated important observed conditional distributions complex multiple modes main weaknesses current approach tree distributions defined fixed nonoverlapping conditional distributions estimated small number samples hope address limitations future work acknowledgments research bonet supported search program university research initiative paul viola office naval research grant references golden modeling estimation multiresolution stochastic processes ieee transactions information theory simoncelli wavelet image coding based conditional probability model proceedings munich germany classification textures gaussian markov random fields proceedings international joint conference acoustics speech signal processing volume pages dayan hinton neal zemel helmholtz machine neural computation bonet multiresolution sampling procedure analysis synthesis texture images computer graphics donoho adaptation unknown smoothness wavelet shrinkage technical report stanford university department statistics tech report duda hart pattern classification scene analysis john wiley sons gabor filters texture biological cybernetics geman stochastic relaxation gibbs distributions bayesian restoration images ieee transactions pattern analysis machine intelligence heeger bergen texture computer graphics proceedings pages hinton dayan frey neal wakesleep algorithm unsupervised neural networks science simoncelli adelson noise removal bayesian wavelet ieee intl conf image processing switzerland ieee simoncelli freeman adelson heeger multiscale transforms ieee transactions information theory mumford filters random fields maximum unified theory texture modeling intl journal computer vision
0 hierarchical learning control approach neuronlike associative memories institut abstract advances brain theory complementary approaches analytical investigations measurements modelling supported computer simulations generate hypothesis structures neural tissue paper research line starting inspired model stimulus response andor associative psychological motivated basic control tasks conditions studied cooperation units hierarchical organisation assumed general layout brain introduction theoretic modelling brain theory highly subject clear picture complicated device measurements sound andor damaged brain parts general physics realize levels modelling physics mary level assemblies general behavioural models kinematics mechanics brain theory chemical reactions electrical spikes neuronal cell assembly cooperation general human behaviour research discussed paper located direct study synaptic cooperation neuronal cell assemblies studied amari takes account synaptic weighting simulating physical details makes general learning situation stimuli response connections building trainable basic control loops dynamic elements complex behavioural loops general work make steps studying struc tures conditions building hierarchies generate hypothesis reasons american institute physics meaning substructures brain columnar cerebral cortex compare paper organized chapter short tion basic elements building hierarchies learning control loop role memory system inspired neuronal network considerations chapter starts remarks structures brain discusses cooperation elements hierarchies substructures chapter specifies steps paper direction chapter chapter presents results achieved puter simulations finally investigations formal neuron introduced mcculloch pitts kinds neural network models proposed perceptron rosenblatt neuron equation cerebellar model articulation controller cmac albus associative memory models fukushima kohonen amari ability systems store information efficiently perform pattern recognition adequate substructures brain organization call microstructure acting means goal driven coordination sensory formation motor actions human brain complex solution evolution authors hierarchical combination basic elements perform elementary human brain total high similarity basic neuronal tissue human rela tively simple design learning control loop authors basis psychological findings transformed state ment complete intelligent action elements question search actions hypothesis solutions control selects solution chosen structure shown identifying question performance criterion assessment actions hypothesis predictive model environment answers control control selects situations action situations action active learning detail understood predictive model built step step procedure characterization actual situation time instant sampling time measured response unknown time instant actual situation consists measurements stimuli responses environment time instant unique char responses time provided short term memory reduce learning effort associative memory system store predictive model ability local generalization means making trained response actual situation similar situations assessment module generates basis goal wanted environment response adequate performance criterion evaluation actions testing model built answers result stored control quality real optimal action actual ation optimal action testing reached border area predictive model environ ment case real action changed sense action area predictive model extended reaches case real optimal actions guess good action optimization phase assessment module control strategy avoid unnecessary complication finally planning level superfluous quick optimal reactions checking planning level helpful find environment changed possi associative memory system control locally generalizing reduce training effort storage elements predictive model opti actions refinement implementation online application neuronal network model cmac albus locally generalizing neural network model storage element based pure mathematical considerations shown important property build excellent capability handle tasks environment sensory information property basic structure nervous system proven application control number technical processes starting empty memories predictive model control strategy storage details mathematical equations describing found mentioned concept explicit predictive environmental model description human handling part basic learning element suffices prediction action reach actual goal case information basic element general penalty performance degradation hierarchies number reasons brain built hierarchy control loops higher levels functions simple shows necessity cases legs jack move move separately connection build separate controller controller hierarchically higher level possibility coordinated movements find evolution historical development animals complex sense multilevel hierarchy exists motor system albus specifies levels hierarchy motor control hierarchical existing level levels general abstractions thinking supports idea assumes hierarchies fundamental element brain structuring details numbers substructures groupings substructures brain connection finds cortical layers detailed columns cell assemblies heavily connected axis vertical cortical layers sparsely connected horizontally defines comprise neural tissue roughly neural tissue roughly individual cells addi tion consist hundreds located called abstraction structures interpreted considered type number units shown ring structure filled structure building signals elements overlapping elements higher cortical layer layer projecting layer andor coordinate cooperation hierarchical sense complex system difficult simulate direction step step procedure step overlapping crosstalk suppressed number representing mini columns reduced heavily motivates research cooperation ments topics addressed lowest level coordination layer means coordination implemented half reasons number fundamental questions posed discussed formulation difficult meaningful coordination goals higher order system problem discussed understood coordination chapter coordination higher level subtasks detailed andor systems left half important questions hierarchies learning control loops meaningful lower level systems parallel learning levels requires meaningful learning strategy control subtasks learned coordination learned expects lower level takes care short term requirements upper level long term strategies upper level works time horizon lower levels expects upper level goals lower level lower level suppress disturbances effects upper level minimize energy consumption strategies work oscillations system question discussed general arguments tions answers simulation results chapter shows intervention schemes case intervention structure parameters controllers meant associative mappings parameters directly responsible behaviour controller case linear nonlinear differential equation conventional controller make sense controller built possibility change parameters elements means structural terms performance criterion responsible shaping controller require learn takes long time span general case distribution work load control commands meant idea control inputs hold long range required local controllers account fast dynamic fluctuations disadvantage control actions upper level included inputs local controllers extending dimension input space storage devices process appears highly time variant local controllers difficult handle case solution case commands points local controllers generating local lower level controllers requires input space extension local controllers full agreement working conditions single loops meaningful effective approach shows built structure detail control strategy divided parts storage element controller active learning elements explicitly characterized upper level lower level considered single pseudo process controlled simulation results questions simple nonlinear process shown coupling comparison bottom parallel learning suitably fixed bottom learning simulating optimally trained local controllers shows result time required achieving good point assistance repetition good performance reached point change parallel learning empty beginning practically performance achieved bottom training indicating simple problems considered parallel learning real illustrated sampling time sufficiently long local control reach defined qualitatively time span question respect higher difference time horizon local controller picture doubling sampling rate implemented give results interpreted smaller sampling rates information global goal reached faster larger sampling rates lead performance goal reached higher amount averaging levels goal performance criterion minimization differences actual plant output requested plant output influence goals question investigated simulating stage water process detailed description process simulation results space reasons found hierarchical systems satisfactory behaviour reached defined goals learning goal driven accept implicit wishes closed loop behaviour fulfilled chance important requirements included performance criteria explicitly finally mind simulation results single process behaviour excluding cases behaviour mentioned chapter work steps investigations hierarchical organization brain behav subjects research selforganizing task distribution processing units layer formation projections order build composed sequence frequently occuring elementary tasks investigations hand show extent kind functions achieved structures model lowlevel basic learning behaviour acknowledgements work presented supported partly detailed evaluations chapter performed assistance references albus albus albus amari amari theoretical experimental aspects cerebellar model thesis univ maryland approach manipulator control cerebellar model articulation controller cmac trans series model brain robot control part comparison brain model neural theory association concept formation biol cybernetics mathematical theory selforganization neural nets organization neural networks structures models outline theory thought process thinking machines journal theoretical biology verlag huber application associative neural network models technical control problems localization orientation biology engineering springer verlag berlin control selforganizing concept associative memories denmark software implementation neuronlike associative memory system control application proceedings conference mini appli cations switzerland realtime implementation associative memorybased learning control scheme linear multivariable processes symposium applications multivariable system techniques concept learning control inspired brain theory proceed world congress learning control structures neuronlike associative memory systems organization neural networks structures models fukushima model associative memory brain biol cybernetics kohonen associative memory springer verlag berlin mcculloch pitts logical calculus ideas nervous activity bull math biophys organizing principle cerebral function unit module distributed system brain edelman cambridge verlag rosenblatt perceptron recognizing automation laboratory report figures long term element nucleus nucleus nucleus hierarchy motor control exists extra pyramidal motor system basic remain brain stem coordination standing sequential coordination required walking requires area simple tasks executed region intact lengthy tasks complex goals require cerebral cortex albus generic drawn ring structures cortical layers representing simplified research model cooperation columnar structures process hierarchical work control distribution methods intervention implementation hierarchical structure nonlinear hierarchical structure nonlinear multivariable reference reference learning level trained untrained lower levels learning behaviour
4 perturbing hebbian rules peter dayan salk institute diego geoffrey goodhill university abstract recently linsker mackay miller analysed hebbian correlational rules synaptic development visual system miller studied rules case populations eyes analysis assumed populations correlational structure relaxing constraint effects small perturbative correlations eyes permits study stability solutions predict circumstances qualitative including production driven units introduction linsker studied hebbian correlational rule predict development receptive field structures visual system mackay miller pointed form learning rule meant analysed terms eigenvectors matrix presynaptic correlations miller independently studied similar correlational rule case eyes generally populations explaining cells develop ultimately responsive starting responsive process driven eigenvectors eigenvalues developmental equation miller relates linskers model population case analysis assumes correlations activity tion identical special case simplifies analysis enabling projections eyes separated difference variables general dayan goodhill expect correlations differ slightly correlations eyes analyse perturbations affect tors eigenvalues developmental equation explain results found empirically miller details analysis relationship hebbian models development ocular dominance orientation selectivity found goodhill equation mackay miller study linskers developmental equation form weights units layer unit layer covariance matrix activities units layer matrix vector equivalent populations cells covariance cells population cells assumed symmetric ance cells populations define full population development matrix miller studies case generally slightly negative development miller calls separate forms andor weight saturation patterns dominance populations determined initial fastest growing components upper lower weight saturation limits reached roughly time personal communication conventional assumption fastest growing eigenvectors dominate terminal state starting condition miller small weights constrained positive saturate upper limit additive applied development affects growth modes discussed mackay miller approximately component mackay miller analyse eigendecomposition general radially symmetric covariance matrices values turns eigendecomposition case studied miller table form perturbing hebbian rules conditions figure shows matrix eigenvectors details decomposition table slightly eigendecomposition write consequence rows table eigenvector important development separates forms eigenvectors terms onset dominance populations eigenvectors dominance requires eigenvector elements sign exists larger eigenvectors pages shows cases happen understand treat perturbed version perturbations case small correlations projections andor small differences correlations projection instance examples small prevent onset dominance analysed setting call resulting matrix questions relevant firstly eigenvectors stable perturbation vectors eigenvector eigenvector eigenvalue eigenvalues change calculate equation perturbed eigenvector satisfy conditions values terms specific notation table eigenvector eigenvalue subtracting implies standard method linear systems quantum mechanics dayan goodhill symmetric eigenvector eigenvalue multiplying left require sets stable required manner similarly stable equivalent perturbation eigenvalue pair stable eigenvalue broken specific eigenvectors stable values means order longer separate full matrix solved model results call special case assume normalised eigenvalues eigenvalue case case miller treats original solution preserved perturbed versions eigenvalues modes separate perturbed eigendecomposition suffices show small additional correla tions affect solutions give examples case mentioned page shows small radius arbor function eigenvector components sign change growing faster components positive negative ensure growing slower converting monocular solution binocular terms case negative matrix conditions signs components negative eigenvalue perturbed expected decrease perturbed found binocular affected amounts typically issue ultimately figure shows sample perturbed matrix dominance develop change correlations large eigenfunctions change shape notation address perturbing hebbian rules figure correlation matrix eigenvalues figure matrix eigenvectors eigen values order dayan goodhill positive correlations effect time greater eigenvalue perturbed expected decrease perturbed figure shows dominance case general perturbations mere signs components eigenvectors predict affected figure ocular dominance occur note eigenvector longer stable replaced form general perturbations order magnitude difference applied terms analysis order apply iteration matrix difference projections iterations component sets components collecting terms expression equation derive part expression depends substantial term system bias competition eigenvectors binocular solutions precise effects sensitive eigenvalues conclusions perturbation analysis applied simple hebbian correlational learning rules reveals introducing small tendency agrees results miller introducing small positive correlations eyes occur experience natural environment effect stable small perturbations make correlational structure eyes unequal produces interesting effects growth rates eigenvectors concerned initial conditions approximately equivalent projections eyes acknowledgements grateful miller helpful discussions christopher pointing direction perturbation analysis support perturbing hebbian rules figure positive correlation matrix eigenvectors eigenvalues ocular dominance inhibited figure effect random perturbations matrix order restored eigenvalues note eigenvector dayan goodhill foundation science travel grant grateful david willshaw centre cognitive science current address centre cognitive science university edinburgh place edinburgh correspondence directed references goodhill correlations competition optimality modelling topography ocular dominance thesis university linsker basic network principles neural architecture series proc acad mackay miller analysis linskers simulations rules neural computation mackay miller analysis linskers application hebbian rules linear networks network miller correlationbased mechanisms visual cortex theoretical empirical studies thesis stanford university medical school miller correlationbased mechanisms neural development gluck rumelhart editors neuroscience connectionist theory lawrence erlbaum miller derivation linear hebbian equations nonlinear hebbian model synaptic plasticity neural computation miller keller stryker ocular dominance column devel analysis simulation science
4 constrained optimization applied parameter setting problem analog circuits david kirk alan computer graphics california institute technology pasadena abstract constrained optimization select operating parameters circuits simple square root circuit analog vlsi artificial cochlea automated method computer controlled test equipment choose chip parameters minimize difference actual circuits behavior goal behavior choosing proper circuit parameters important deviations adjust circuit performance range analog vlsi circuits increasingly complex implying parameters setting parameters hand cumbersome automated parameter setting method great automated parameter setting integral part engineering design methodology circuits constructed parameters enabling wide range behaviors tuned desired behaviors automatically introduction constrained optimization methods setting parameters analog circuits present experiments automated method successfully finds parameter settings circuits behavior closely approximate desired behavior experiments tion difficult subproblems encountered building electronic setup dept electrical engineering kirk watts acquire data control circuit computation desired behavior mathematical form suitable optimization tools describe components electronic setup section discuss selection optimization technique section automated parameter setting important component system build accurate analog circuits power method enhanced including appro parameters initial design circuit build circuits wide range behaviors tune desired behavior section describe comprehensive design methodology embodies strategy implementation system test ideas system conceptually decomposed distinct parts circuit analog vlsi chip intended compute function target function computational model quantitatively describing desired havior circuit model parameters circuit expressed terms biological data circuit mimic error metric compares target actual circuit function computes difference measure constrained optimization tool numerical analysis tool chosen based characteristics problem posed circuit parameters constrained optimization tool target function difference measure constrained optimization tool error metric compute difference performance circuit target function adjusts parameters minimize error metric causing actual circuit behavior approach target function closely generic physical setup optimization typical physical setup choosing chip parameters computer control elements analog vlsi circuit digital computer control optimization process computer programmable sources drive chip computer programmable measurement devices measure chips response combination elements environment testing chips setting parameters performed level constrained optimization applied parameter setting problem analog circuits automation desirable inputs chip measurements outputs controlled computer experiments perform experiments parameters analog vlsi circuits optimization experiment simple system square root circuit experiment complex timevarying system analog vlsi electronic cochlea cochlea composed cascaded order section filters square root experiment experiment examine circuit mead computes typically introduce parameter circuit varies indirectly adjusting voltage square root circuit shown figure alter shape response curve chip data figure square root circuit resulting control values circuit choose error metric optimizes targeting curve slope space safely ignore purposes experiment entire optimization process takes minutes simple system figure shows final results square root computation circuit output normalized analog vlsi cochlea complex system test constrained optimiza tion technique chose silicon cochlea lyon silicon cochlea cascade lowpass secondorder filter sections arranged natural frequency stages decreases exponentially distance kirk watts cascade quality factor filters section determines peak gain figure cochlea circuit performance cochlea natural taps peak gain performance parameters controlled bias voltages problem circuit find bias voltages give desired performance optimization task lengthy square root optimization measurement frequency response takes minutes composed individual cochlea results results attempts parameters analog vlsi cochlea encouraging figure error metric trajectories gradient descent cochlea figure shows trajectories error metrics cochlea progress made early steps optimization constrained optimization applied parameter setting problem analog circuits proceeding valley error surface shown figure looo goal data goal data figure target frequency response gradient descent optimized data cochlea figure shows target frequency response data frequency responses result chosen parameter settings curves similar differences scale measurement noise resolution system cochlea optimization strategies explored optimization strategies finding parameters electronic cochlea interest special knowledge priori knowledge effect guide optimization gradient descent assume inputoutput relation chip estimate gradient gradient descent varying inputs robust numerical techniques conjugate gradient helpful energy landscape steep found gradient descent technique reliable converge quickly special knowledge optimization corresponds intuition special knowledge circuits operation setting parameters choosing optimization method element system worked difficulty optimiza tion complex circuits require sophisticated optimization methods wide variety constrained optimization algorithms exist kirk watts bias bias figure error surface error metric frequency response cochlea note narrow valley error surface target minimum lies left part valley effective classes problems gradient descent simu lated annealing platt press choose method problem hand techniques simulated find optimal parameter combinations systems complex behavior confidence methods work complex circuits choice error metric complex circuits systems timevarying signals error metric captures time signal deal hysteresis beginning state path optimization step noisy nonsmooth functions improved averaging data robust techniques sensitive noise conclusions constrained optimization technique works welldefined goal chip operation compare automated parameter setting adjustment hand humans fail situations optimization fails multiple local minima contrast larger dimensional spaces hand adjustment difficult optimization technique succeed expect integrate technique chip development process future developments move optimization learning process gradually chip interesting note gradient descent method learns parameters chip manner similar backpropagation constrained optimization applied parameter setting problem analog circuits perspective work step path robust onchip learning order technique moderately difficult problems interface equipment parameters record results circuit computer control voltage rent sources digital cost similar setup circuits difficult issue target function circuit compute error metric simple circuit concerned behavior region entire range operation care ensure combination target model error metric accurately describes desired behavior circuit existence automated parameter setting mechanism opens avenue producing accurate analog circuits goal accurately computing function differs approach providing cheap simple circuit loosely function gilbert mead providing parameters design circuit ensure desired function domain circuit behaviors expected define domain circuit parameter setting apparatus optimization methods find solution domain potentially accurate high degree precision engineering design technique results optimization experiments suggest comprehensive engineering design technique directly affects design test chips results change types circuits build optimization techniques design build circuits frequently work expected meeting design goals corollary attack larger interesting problems technique composed steps identify target function behavioral goals design circuit design design circuit adjustable parameters attempting make desired target circuit behavior actual circuit expected variation device tics optimization plan devise optimization strategy explore parameter includes capabilities digital computer control opti mization instruments apply chip measure outputs optimization optimization procedure select parameters minimize actual circuit performance target function optimization make special knowledge circuit effect interaction good region explore kirk watts design circuit design optimization plan circuit optimization process produces design goals influence circuit design form optimization plan important produce match design circuit plan optimizing parameters acknowledgement carver mead ideas encouragement support project john physical setup equipment work supported part bell laboratories fellowship additional support provided findings conclusions expressed document author necessarily reflect views references platt approach solving parameter setting problem intl conf system ences january gilbert gilbert precise multiplier response ieee journal murray wright practical optimization academic press lyon lyon mead analog electronic cochlea ieee trans speech signal proc volume number july mead mead analog vlsi neural systems addisonwesley platt platt constrained optimization neural networks puter graphics thesis california institute technology june press press teukolsky numerical recipes scientific computing cambridge university press cambridge
11 field methods classification gaussian processes manfred opper neural computing research group division electronic engineering computer science aston university birmingham winther theoretical physics lund university lund connect bohr institute university copenhagen copenhagen denmark abstract discuss application field methods statistical mechanics disordered systems bayesian classifi cation models gaussian processes contrast previous proaches knowledge distribution inputs needed simulation results sonar data modeling gaussian processes bayesian models based gaussian prior distributions function spaces promising nonparametric statistical tools recently introduced neural computation community neal williams rasmussen mackay give basic definition assume likelihood output target variable input written form priori assumed gaussian random field assume fields prior statistics defined order correlations denotes expectations opper respect prior interesting examples choice motivated limit neural network infinitely hidden units inputhidden weight priors williams hyperparameters determining relevant prior simplest choice corresponds single layer perceptron independent gaussian weight priors bayesian framework make predictions input received training examples posterior distribution field test point conditional gaussian distribution posterior distribution field variables training points normalizing partition function prior distribution fields training points introduced major technical problem approach difficulty forming high dimensional integrations nongaussian likelihoods treated approximations monte carlo sampling neal laplace integration barber williams bounds likelihood gibbs mackay paper introduce approach based field method statistical physics disordered systems parisi specialize case binary classification problem binary class label predicted training corrupted label noise likelihood problem probability true classification label corrupted step function defined case expect model laplaces method bounds introduced gibbs mackay directly applicable field methods classification gaussian processes exact posterior averages order make prediction input ideally label maximum probability chosen predic tive probability binary case bayes classifier paper brackets denote posterior averages simpler approach prediction reduce ideal prediction posterior distribution symmetric goal field approach provide equations approximately determining starting point analysis partition function auxiliary variables integrated imaginary axis introduced order hard show posterior averages fields training inputs test point reduced problem calculation microscopic rameters averages statistical physics calculated derivatives respect small external fields equivalent formulation legendre transform function expectations case additional averages introduced dynamical vari ables unlike ising spins fixed length external fields eliminated true expectation values satisfy naive field theory description give calculated nongaussian likelihood models interest based field theory guess approximate form integrations imaginary axis expectations positive fact integration measure complex opper winther field methods found interesting applications neural computing framework ensemble learning exact posterior distribution approximated simpler product distributions variational treat ment standard field method posterior case gaussian process classification preparation discussed paper suggest route introduces nontrivial corrections simple naive variables variational method purely formal distribution complex define probability ways define simple perturbation expansion respect interactions order approaches yield result contribution model interactions legendre transform error function simple models statistical physics interactions care positive equal easy show exact limit infinite number variables systems large number nonzero interactions orders magnitude expect approximation approach interactions positive negative expect inputs thermodynamic limit nice distributions inputs additional contribution added naive field theory correction called reaction term introduced spin glass model anderson palmer applied statistical mechanics single layer perceptrons generalized bayesian framework opper winther application multilayer networks wong thermodynamic limit infinitely large dimension input space nice input distributions results shown coincide results replica framework drawback previous derivations neural networks fact special assumptions input distribution made fluctuating terms replaced averages distribution random data practice paper approach parisi circumvent problem concluded applied case spin model random interactions specific type functional form depend type single particle contribution model calculated gaussian regression model subtract naive field contribution obtain classification gaussian processes desired sake simplicity chosen simpler model changing final result lengthy straightforward calculation problem leads result eliminated leads equation note choice field theory exact gaussian likelihoods standard regression problems finally setting derivatives respect variables equal obtain equations gaussian measure solved numerically contrast naive simpler result found simulations solving nonlinear system equations iteration turns straightforward data sets convergence diagonal term covariance matrix shown term learning gaussian noise variance added gaussian random field present simulation results single data sonar versus rocks split original study sejnowski input data preprocessed linear rescaling training input variable unit variance cases field equations failed converge data important feature fact method approximate leaveoneout estimator generalization error expressed terms solution field equations opper winther details derive leaveoneout estimator naive opper winther published dealt problem automatically estimating hyperparameters number drastically reduced setting covariances remaining hyperparameters chosen opper table result sonar data exact algorithm covariance function field naive field backprop simple perceptron hidden minimize turned lowest found modeling noise simulation results shown table comparisons backpropagation sejnowski solution found algorithm turned unique order presentation examples ferent initial values solution table compared estimate algorithm exact leaveoneout estimate exact obtained training keeping testing running field algorithm rest estimate exact complete agreement comparing test error training hard test easy small difference test error naive full field algorithms field scheme robust respect choice discussion work make approach practical tool bayesian modeling find methods solving equations conversion direct minimization problem free energy helpful achieve work real field variables imaginary problem determination hyperparameters covariance functions ways interesting approximate free energy essentially negative logarithm bayesian evidence estimate probable values hyperparameters estimate errors made approach builtin leaveoneout estimate estimate generalization error estimate validity approximation inter apply deriving equations models boltzmann machines belief nets combinatorial optimization standard field theories applied successfully acknowledgments research supported foundation research research natural technical sciences computational neural network center connect methods classification gaussian processes references barber williams gaussian processes bayesian classification hybrid monte carlo neural information processing systems mozer jordan petsche press gibbs mackay variational gaussian process classifiers preprint cambridge university sejnowski analysis hidden units layered network trained classify sonar targets neural networks mackay gaussian processes replacement neural networks nips obtained space interactions neural networks computation cavity method phys parisi spin glass theory lecture notes physics world scientific neal bayesian learning neural networks lecture notes statistics springer neal monte carlo implementation gaussian process models bayesian regression classification technical report dept computer science university toronto opper winther field approach bayes learning feed forward neural networks phys lett opper winther field algorithm bayes learning large feedforward neural networks neural information processing systems mozer jordan petsche press parisi meanfield equations spin models orthogonal interaction matrices phys math convergence condition equation ising spin glass phys anderson palmer solution solvable model spin glass phil williams computing infinite networks neural information cessing systems mozer jordan petsche press williams rasmussen gaussian processes regression neural information processing systems touretzky mozer hasselmo press wong microscopic equations stability conditions optimal neural networks europhys lett
12 neural computation winnertakeall nonlinear operation maass institute theoretical computer science technische email httpwww abstract neural networks single layer nonlinear units compute interesting functions show false employs winnertakeall nonlinear unit boolean function computed single unit applied weighted sums input variables continuous function approximated arbitrarily single soft winnertakeall unit applied weighted sums input variables positive weights needed linear weighted sums interest point view neurophysiology synapses cortex inhibitory addi tion widely believed special cortex compute winnertakeall results support view winnertakeall basic computational unit neural vlsi wellknown winnertakeall input variables computed efficiently transistors wire length area linear analog vlsi lazzaro show winnertakeall special pose computations serve nonlinear unit neural circuits universal computational power show multilayer perceptron quadratically gates compute winnertakeall input variables winnertakeall substantially powerful computational unit perceptron cost implementation analog vlsi complete proofs details results found maass maass introduction computational models involve competitive stages neglected putational complexity theory widely computational brain models artificial neural networks analog vlsi circuit lazzaro computes approximate version winnertakeall inputs transistors wires length lateral inhibition implemented adding currents single wire length numerous efficient implementations winnertakeall analog vlsi subsequently produced circuits based silicon spiking rons circuits emulate attention artificial sensory processing horiuchi preceding analytical results winnertakeall circuits found grossberg brown analyze section computational power basic competitive compu tational operation winnertakeall section discuss complex operation implemented analog vlsi section devoted soft winnertakeall implemented analog vlsi temporal coding output results shows winnertakeall surprisingly powerful computational module comparison threshold gates neurons sigmoidal gates theoretical analysis answers basic questions raised view wellknown asymmetry excitatory inhibitory connections cortical circuits computational power neural networks lost positive weights employed weighted linear sums learning capability lost positive weights subject plasticity neural circuits digital output investigate section computational power gate comput function largest inputs precisely holds indices neural computation theorem twolayer feedforward circuit analog binary input variables binary output variable consisting gates simulated circuit consisting single gate applied weighted sums input variables positive weights holds digital inputs analog inputs inputs measure boolean function computed single gate applied positive weighted sums input bits remarks polynomial size integer weights size bounded number linear gates bounded polynomial weights simulating circuit natural numbers size bounded polynomial exception measure result union hyper planes easily show exception measure theorem circuit structure converted back thresh circuit number gates quadratic number weighted sums gates relies construction section proof theorem outputs gates hidden layer assume loss generality weights gate details observes suffices integer weights threshold gates binary inputs normalize weights values gates hidden layer circuit input threshold gates hidden layer threshold output gate order eliminate negative weights replace gate threshold gate hyperplane exception consisting hyperplanes weights threshold gate output exploit arbitrary maass arbitrary threshold gates threshold gate weights back linear gates positive weights sums absolute weights gates implies output gate neural computation winnertakeall applied satisfies note coefficients sums positive neural circuits analog output order approximate arbitrary continuous functions values circuits similar structure preceding section variation winnertakeall gate outputs analog numbers values depend rank input linear order input numbers argue gate longer winnertakeall gate agreement common terminology refer soft winnertakeall gate gate computes function soft winnertakeall roughly proportional rank numbers precisely parameter gate focuses inputs rank input numbers belongs ranks linearly scaled theorem circuits consisting single soft winnertakeall gate output applied positive weighted sums input variables universal approximators arbitrary continuous functions shown maass continuous monotone scaling maass circuit type considered theorem soft winnertakeall gate applied positive weighted sums simple geometrical interpretation point input plane relative heights hyperplanes defined positive weighted sums circuit output hyperplanes point lower bound result winnertakeall easily kwta gate inputs computed thresh circuit consisting threshold gates threshold gates threshold gates result optimal lower bound theorem feedforward threshold circuit perceptron computes inputs gates conclusions lower bound result theorem shows computational power large compared powerful gate commonly studied circuit complexity theory threshold gate referred neuron perceptron neural computation winnertakeall minsky papert single threshold gate compute important functions circuits moderate polynomial size consisting layers threshold gates polynomial size integer weights computational power shown theorem hidden layer circuit simulated single gate applied polynomially weighted sums positive integer weights poly size analyzed computational power soft winnertakeall gates context analog computation shown theorem single soft winnertakeall gate serve nonlinearity class circuits universal computational power sense approximate continuous functions universal approximators require positive linear operations sides soft winnertakeall showing principle computational power lost biological neural system inhibition exclusively lateral tion adaptive flexibility lost synaptic plasticity learning restricted excitatory synapses surprising results computational power winnertakeall point lowpower analog vlsi chips winnertakeall implemented efficiently technology references brown brown neural switching network trol caltech grossberg grossberg contour enhancement short term memory neural networks studies applied mathematics horiuchi horiuchi morris koch deweerth analog vlsi circuits visual tracking advances neural informa tion processing systems modeling selective attention neuromorphic analog vlsi device submitted publication lazzaro lazzaro ryckebusch mahowald mead winnertakeall networks complexity advances neural information process systems morgan kaufmann mateo maass maass computational power winnertakeall neural computation press pulse coded winnertakeall networks silicon implementation pulse coded neural networks kluwer academic publishers boston minsky papert minsky papert perceptrons press cambridge roychowdhury kailath discrete neural putation theoretical foundation prentice hall englewood cliffs circuit complexity ieee trans neural networks
3 neural network implementation admission control guyon bell laboratories corner holmdel abstract feedforward layered network implements mapping required control unknown stochastic nonlinear dynamical system training based approach combines stochastic approximation ideas back propagation method applied control admission queueing operating timevarying environment introduction controller discretetime dynamical system provide time control variable information state system decision made observable determined basis current observation preceding control action information controller implements mapping controllers suffice static require control policy constant mapping closedloop controllers dynamic control action determined information work addresses question training neural network implement general mapping problem arises lack training patterns input quality control policy assessed control system monitoring system performance sensitivity performance variations control policy investigated analytically system unknown show sensitivity estimated stan dard framework stochastic approximation usual backpropagation algorithm determine sensitivity output variations parameters network adjusted improve system performance advantage neural network closedloop controller ability accept inputs additional time steps past provide infor mation history controlled system demonstrated neural network controllers capture regularities structure timevarying environments powerful tracking time variations driven stationary stochastic processes guyon solla control stochastic dynamical systems dynamical system state updated discrete times control input effect time affects dynamical evolution stochastic process models intrinsic randomness system external disturbances variable accessible direct measurement knowledge state system limited observable goal design neural network controller produces specific control variable applied time information order design controller implements control policy specification purpose controlling dynamical system needed typically function observable measures system performance composition function state system control variable stochastic variable quantity interest expectation system performance averaged respect average expectation estimated ergodic system goal controller generate sequence control values average performance stabilizes desired parameters neural network adapted minimize cost function neural network implementation admission control dependence implicit depends controlling sequence depends parameters neural network online training proceeds gradient descent update minimization instantaneous deviation target output controller expected provide response input output considered variable controls subsequent performance factor measures sensitivity output neural network controller internal parameters fixed input output function network parameters gradient scalar function easily computed standard backpropagation algorithm rumelhart factor measures sensitivity system performance control variable information system needed evaluate derivative unknown functions describes affected fixed function describes dependence propagates observable algorithm rendered operational stochastic approximation kushner assuming average system performance monotonically increasing function sign partial derivative positive stochastic approximation amounts neglecting unknown fluctuations derivative approximating positive step size online update rule stochastic approximation online gradient update instantaneous gradient based current measurement gradient expected guyon solla deviations respect target minimized combined backpropagation stochastic approximation evaluate leading update rule general powerful learning rule neural network controllers requirement average performance monotonic function control variable section illustrate application algorithm admission controller traffic queueing problem advantage neural network standard stochastic approximation approach apparent mapping produces track timevarying environment generated stationary stochastic process straightforward extension approach discussed train network implement mapping time steps past provide information history controlled system network capture regularities time variations environment queueing problem admission controller queueing system depicted system includes server queue call admission mechanism controller local arrivals server arrivals admission queue rate controller services figure admission controller problem neural network implementation admission control serve independent traffic streams single server arises networks typical situation addition remote arrivals monitored control node local arrivals admission queue monitored limited information scenario controller execute policy meets performance objectives situation model streams offered queueing system remote traffic local traffic streams poisson times independently exponentially distributed calls originated remote stream controlled admission queue local calls controlled monitored arrival rate remote calls fixed rate local calls timevarying depends state stationary markov chain service time required call type exponentially distributed random variable calls find empty queue arrival immediately service wait queue service arrival assigned threshold independently drawn fixed unknown distribution characterizes behavior waiting time queue exceeds threshold call ideally incoming call server process average calls unit time offered load approaches exceeds queue starts build long result long delays induce heavy limits reject remote arrivals call admission mechanism implemented shown figure rate control berger tokens arrive deterministic rate finite tokens find full bank lost token needed remote call find empty token bank rejected queue tokens calls remote controlled local calls local arrival rate controlled underlying markov chain process transition rate neighboring states markov chain state local arrival rate complete specification state system time require information number arrivals services remote local traffic preceding time interval duration controllable remote traffic waiting time call local traffic monitored information arrivals waiting times accessible information remote traffic number rejected calls number number calls information time includes preceding control action controller determine guyon solla goal control policy calls compatible rate ratio plays role performance measure target values excess imply excessive number require admission control values smaller penalized obtained expense results simulations reported correspond server capable handling calls rate unit time remote traffic arrival rate local traffic arrival rate controlled markov chain offered load spans range steps transition rates markov chain simulate slow moderate rapid variations offered load neural network controller receives inputs time input units hidden layer units information single output unit bound rate check neural network controller capable correct generalization network trained timevarying scenario subject static testing training takes place offered load varying rate network tested underlying markov chain frozen fixed long period stabilize control variable fixed obtain statistically meaningful values careful numerical investigation quantities function reveals neural network developed adequate control policy light loads spontaneously result values require control guarantees ample token supply exceeds system controlled decreasing increasing satisfy requirement detailed results static performance comparison standard stochastic approximation approach reported tracking timevarying environment power neural network controller revealed network trained varying offered load tested dynamically monitoring distribution network controls environment varying rate training distribution prob shown neural network controller outperforms stochastic approximation system probability keeping rate bounded larger controller values bound goal exceeding achieved probability comparison rejection distribution prob shown illustrates control policy provided results shown confirm superiority control policy stochastic approximation fixed gain enable controller track timevarying environments gain optimized numerically neural network implementation admission control developed neural network neural control stochastic approximation stochastic approximation figure distribution rejection distribution conclusions control unknown stochastic system requires mapping implemented feedforward layered neural network learning rule blend stochastic approximation backpropagation proposed overcome lack mining patterns online performance information provided system control tested admission control problem approach shows promise variety applications control networks references berger control rate control selecting token capacity robustness arrival rates ieee transactions automatic control kushner stochastic approximation methods constrained unconstrained systems springer verlag queueing systems volume theory john wiley sons rumelhart hinton williams learning representations back propagating errors nature
0 computer simulation olfactory cortex functional implications storage retrieval olfactory information wilson james bower computation neural systems program division biology california institute technology pasadena abstract based anatomical physiological data developed computer simulation form olfactory cortex capable reproducing spatial temporal patterns actual cortical activity variety conditions simple hebbtype learning rule tion cortical dynamics emerge anatomical physiological tion model simulations capable establishing cortical representations differ input patterns basis representations lies interaction sparsely highly interconnections modeled neurons shown representations stored minimal interference learning representations input degradation allowing reconstruction representa tion partial presentation original training stimulus demonstrated degree overlap cortical representations stimuli modulated instance similar input patterns induced generate distinct cortical representations discrimination dissimilar inputs induced generate overlapping representations features important classifying stimuli introduction piriform cortex primary olfactory cerebral cortical structure receives order input olfactory receptors olfactory bulb believed play significant role classification storage olfactory information years computer simulations tool studying information processing cortex interested higher order functional questions modeling objective construct computer simulation contained sufficient neurobiological detail reproduce experimentally obtained cortical activity patterns step crucial establish correspondences model cortex assure model capable generating output compared data actual physiological experiments current case demonstrated behavior simulation approximates actual cortex model explore types processing cortical structure partic ular paper describe ability simulated cortex store recall cortical activity patterns generated stimulus conditions approach provide experimentally testable hypotheses functional organization cortex deduce solely neurophysiological data american institute physics receptors olfactory bulb piriform cortex olfactory structures hippocampus cortex simplified block diagram olfactory system closely related structures model description model largely instructed neurobiology piriform cortex axon conduction velocities time delays general properties neuronal inte major intrinsic neuronal connections approximate actual cortex simulation reduces number complexity simulated neurons additional information important features cortex obtained incorporated model numbers text refer mathematical expressions found appendix neurons model distinct populations intrinsic cortical neurons fourth cells simulate cortical input olfactory bulb intrinsic neurons consist excitatory population neurons principle neuronal type cortex tions inhibitory simulations population modeled neurons arranged array actual piriform cortex order neurons output modeled cell type action potential generated membrane potential cell crosses threshold output reaches neurons delay function velocity fiber connects cortical distance originating neuron target neuron action potential arrives destination cell triggers conduc tance change ionic channel type cell time amplitude waveform effect conductance change potential drive equilibrium potential channel channels included model channels activated activity synapses cell types afferent local fiber fiber feedback inhibition directed fiber schematic diagram cortex showing excitatory pyramidal cell inhibitory local interactions circles sites synapfic bility connection patterns olfactory system olfactory receptors project olfactory bulb turn projects directly piriform cortex tory structures input piriform cortex olfactory bulb delivered fiber lateral olfactory tract fiber tract appears make sparse excitatory connections feedforward inhibitory neurons extent cortex model input simulated independent cells make connections pyramidal feedforward inhibitory neurons addition input connections olfactory bulb extensive connections neurons intrinsic cortex association fiber system arises pyramidal cells makes sparse distributed excitatory connections pyramidal cells cortex model connections randomly distributed probability model actual cortex pyramidal cells make connections nearby feedforward feedback inhibitory cells turn make reciprocal inhibitory connections group nearby pyramidal cells primary effect feedback inhibitory neurons inhibit pyramidal cell firing mediated current shunting mecha feedforward interneurons inhibit pyramidal cells long latency long duration mediated potential pyramidal cell axons constitute primary output model actual piriform properties modification rules model synaptic weight determines peak amplitude change induced postsynaptic cell presynaptic activity study learning model synaptic weights fiber systems modifiable activitydependent fashion basic modification rule case change synaptic strength proportional presynaptic activity multiplied offset membrane potential baseline potential baseline potential slightly positive equilibrium potential feedback inhibition means synapses activated destination cell depolarized excited state strengthened activated period inhibition weakened model synapses follow rule include association fiber connections excitatory pyramidal neurons connections inhibitory neurons pyramidal rons synapses modifiable actual cortex subject active research model mimic actual synaptic properties input pathway shown undergo transient increase synaptic strength activation independent potential increase permanent synaptic strength subsequently returns baseline generation physiological responses neurons model represented firstorder leaky multiple timevarying inputs simulation runs membrane potentials currents time action potentials stored comparison actual data explicit compartmental model compartments pyramidal cells generate spatial current distributions calculation field potentials evoked potentials stimulus characteristics compare responses model actual cortex actual experimental stimulation simulated cortex resulting intracellular extracellular records stimuli applied characteristic cortical evoked potentials vivo model simulated stimulus paradigm simultaneously activating input fibers measure cortical activity successfully freeman colleagues involves recording activity piriform cortex behaving animals responses generated model steady random stimulation input fibers study learning model physiological measures established required refined stimulation procedures absence specific information actual input activity patterns constructed stimulus randomly selected input fibers stimulus episode consisted burst activity subset fibers duration msec msec intervals simulate actual olfactory bulb input pattern activity repeated trials msec duration roughly corresponds theta rhythm period activity trial presented times total exposure time cortical time period hebb type learning rule modify connection weights activity dependent fashion output measure learning sole output cortex form action potentials generated pyramidal cells output measure model vector spike frequency pyramidal neurons msec trial element vector firing frequency single pyramidal cell figures show array pyramidal cells size cell position represents magnitude spike frequency cell evaluate learning effects overlap comparisons response pairs made taking normalized product response vectors expressing percent overlap simulated actual simulated physiological responses model compared actual conical upper simulated intracellular response single cell paired stimulation input system left compared actual response middle simulated extracellular response recorded conical surface stimulation left compared actual response lower stimulated response cortical surface input left actual freeman computational requirements simulations carried model equipped memory floating point accelerator average time msec simulation minutes results physiological responses initial modeling objective accurately simulate wide range activity patterns recorded piriform cortex physiological procedures comparisons actual simulated records types response shown figure general model replicated physiological responses wilson preparation describes detail analysis physiological results response stimulation input pathway model reproduces principle characteristics intracellular location dependent extracellular waveforms recorded actual cortex percent overlap final response pattern number trials convergence conical response training single stimulus synaptic modification overlap full stimulus training overlap full stimulus training reconstruction cortical response patterns partially degraded stimuli left response training full stimulus left stimulus input fibers degradation response response full stimulus left stimulus input fibers result degradation trained trained retains response storage multiple patterns left response stimulus training middle response stimulus training training response stimulus training training compared original response left response stimulation model exhibits oscillations characteristic activity olfactory cortex awake behaving animals scope present paper simulation damped oscillatory type activity cortex special stimulus conditions learning simulated characteristic physiological responses explore capabilities model store recall information learning case defined development consistent representation cortex input pattern repeated stimulation synap modification figure shows network converges training representation stimulus demonstrated studied proper ties learned responses reconstruction trained cortical response patterns partially degraded stimuli simultaneous storage separate stimulus response patterns modulation cortical response patterns independent relative stimulus characteristics reconstruction learned cortical response partially stimuli interested knowing effect training sensitivity cortical responses fluctuations input signal sented model random stimulus trial synaptic cation trial model presented degraded version half original input fibers comparison responses stimuli naive cortex showed variation model trained full stimulus synaptic tion half input removed model presented degraded stimulus trial synaptic modification case overlap overlap stimulus stimulus training stimulus stimulus training results merging conical response dissimilar stimuli left response stimulus stimulus stimuli activate input fibers common overlap conical response patterns response stimulus training presence common modulatory input overlap conical response patterns ference cortical responses showing training increased robustness response degradation stimulus storage patterns model trained random stimulus response vector case saved continuing weights obtained training model trained overlapping input fibers activated stimulus stimulus stimulus activated roughly cortical pyramidal neurons overlap responses mining period assessed amount interference recalling introduced mining presenting stimulus single trial synaptic modification variation response additional training initially saved demonstrating learning substantially interfere ability recall modulation cortical response patterns previously demon strated stimulus evoked response olfactory cortex modulated factors directly tied stimulus qualities behavioral state animal interested knowing representa tions stored model modulated influence state input potential role state input merge cortical response patterns dissimilar stimuli effect refer test model presented random input stimulus trial presented random input stimulus nonoverlapping input fibers amount overlap cortical responses untrained cases model trained stimulus presence additional random state stimulus activity input fibers distinct overlap overlap stimulus stimulus stimulus stimulus training training results differentiating cortical response patterns similar stimuli left response stimulus stimulus stimuli activate input fibers common overlap cortical response patterns response stimulus stimulus training presence modulatory input training modulatory input overlap cortical response terns model trained stimulus presence state stimulus training model presented stim trial stimulus trial results showed input amount overlap responses found increased role case provide common stimulus component learning reinforced shared components responses input stimuli test ability state stimulus induce differentiation cortical response patterns similar stimuli presented model random input stimulus trial trial random input stimulus input fibers overlapping amount overlap cortical responses untrained cases model trained period stimulus presence additional random state stimulus input fibers overlapping trained input presence random state stimulus input fibers overlapping training model presented stimulus trial stimulus trial amount overlap found decreased situation provided differential signal learning reinforced distinct components responses input stimuli discussion physiological responses detailed discussion mechanisms underlying simulated patterns physiological activity cortex scope current paper model suggesting roles specific features cortex generating physiologically recorded activity actual input cortex olfactory bulb modulated bursts continuous stimulation model allowed demonstrate models capability intrinsic periodic activity independent pattern stimulation olfactory bulb similar ability demonstrated models freeman studying oscillating property model associate oscillatory characteristics specific interactions local distant network properties inhibitory excitatory time constants axonal conduction velocities result suggests underlying mechanisms oscillatory patterns previously proposed learning main subject paper examination learning capabilities cortical model model apparently sparse highly tributed pattern connectivity characteristic piriform cortex fundamental model learns essentially highly distributed pattern connections model develop cortical response terns extracting correlations randomly distributed input association fiber activity correlations effect stored synaptic weights association fiber local inhibitory connections model demonstrated robustness learned cortical response degradation input signal property action sparsely distributed association fibers provide previously established patterns cortical activity property arises modification synaptic weights correlations activity intracortical associa tion fibers result modification activity subset pyramidal neurons driven degraded input drives remaining neurons response general model similar stimuli similar cortical dissimilar stimuli dissimilar cortical responses important function cortex simply store sensory infor mation represent incoming stimuli function absolute stimulus qualities context stimulus occurs fact structures piriform cortex projects receives projections involved multimodal state generation evidence modulation occur demonstrated model input modify representations generated pairs stimuli push representations stimuli pull representations similar stimuli pointed modulatory input signal explicitly directed representa tion state signal require priori knowledge representational structure model modulatory phenomenon simple consequence degree overlap combined odor stimulus stimulus cases approached approximately overlap cortical responses reflecting approximately overlap combined stimuli cases interest models capabilities maintain modulated response input stimulus absence modulatory input conclusions approach studying system involves computer simulation investigate mechanisms information processing implemented biological constraints significance results sented lies primarily finding structure model parameter settings reproduction physiological responses proper convergence simple cally plausible learning rule conditions model developed approximation actual cortex limited knowl edge organization computing power actual piriform cortex order cells compared simulations sparsity connection order compared simulations continuing research effort include explorations scaling properties network assumptions made context current model include assumption representation information pitiform cortex form spatial distributions outputs information contained temporal patterns activity analyzed preliminary observation suggests significance fact dynamics model suggest temporally encoded information input time scales cortex additionally output cortex assumed spatial uniformity differential weighting information made basis spatial location cortex observation dynamics model details anatomical distribution patterns major preliminary evidence model form hierarchical structuring information lines occur cells found progressively rostral locations increasingly odor responses investigations learning model explore issues fully attempts correlate simulated findings actual record ings awake behaving animals time data structure cortex incorporated model emerges acknowledgements lewis haberly joshua roles development continued support modeling effort dave technical assistance work supported grant grant corporation arcs foundation appendix somatic integration number input types membrane potential cell current cell input type equilibrium potential input type resting potential membrane leakage resistance membrane capacitance conductance input type cell spike propagation synaptic input number simulation distance adjacent cells duration conductance change input type velocity signals input type latency input type spatial attenuation factor input type minimum spatial attenuation input type refractory period threshold cell distance cell distribution synaptic density input type synaptic weight cell cell conductance input type cell conductance waveform input type spike output cell time unit step function field potentials number simulation number segments compartmental model approximate extracellular field potential cell membrane current segment cell dendritic model depth site depth segment location cell extracellular resistance unit length number channels segment membrane potential segment membrane capacitance segment resistance segment membrane resistance segment conductance channel segment equilibrium potential channel current segment membrane current segment length segment diameter segment membrane unit length resistance unit length capacitance unit surface references freeman neurophysiol neurophysiol haberly chemical senses wilson bower haberly neuro wilson bower neurosci comp haberly price comp neurol haberly comp haberly bower neurophysiol stevens neurophysiol stevens neurophysiol mori neurophysiol haberly neurosci price comp bower proc natl acad freeman neurophysiol haberly neurophysiol haberly neurophysiol freeman clin neurophysiol freeman schneider science robinson yale biol clin neurophysiol freeman neurol
9 statistically efficient estimation cortical lateral connections pouget zhang abstract coarse codes widely brain encode motor variables methods designed interpret codes population vector analysis inefficient variance estimate larger smallest possi variance biologically maximum likelihood methods attempt compute scalar vector estimate encoded variable neurons faced simi estimation problem read responses presynaptic neurons contrast typically encode variable population code scalar show nonlinear recurrent network form estimation optimal keeping estimate coarse code format work suggests lateral connec tions cortex involved cleaning uncorrelated noise neurons representing similar variables introduction sensory motor variables brain encoded coarse codes activity large populations neurons broad tuning vari ables instance direction visual motion believed encoded visual area responses large number cells bellshaped tuning illustrated figure neurophysiological recordings shown response object moving direction pattern activity population noisy hill activity figure basis activity recover conditional probability direction motion activity slightly goal good guess estimate direction activity stochastic nature noise estimator random variable institute computational cognitive sciences univer sity washington salk institute jolla work funded mcdonnellpew howard hughes medical institute zhang direction preferred direction figure tuning curves direction tuned neurons noisy pattern activity neurons presented direction estimate found moving expected hill activity dotted line squared distance data minimized solid line image vary trial trial good estimator smallest variance trials variance determines similar directions estimator bound analytical lower bound variance noise system unit tuning curves typically computationally simple estimators optimum linear estimator inefficient variances times bound contrast bayesian maximum likelihood equivalent case consideration paper reach bound require complex calculations decoding valuable interested reading population code directly relevant understanding neural circuits perform estimation provide estimate format sensory representations cortex cells estimating orientation noisy responses orientation tuned cells unlike provide scalar esti mate neurons retain orientation coarse code format demonstrated fact cells broadly tuned orientation neurons theory estimation biological networks critical characteristics preserve estimate coarse code efficient variance close bound explore paper network architectures performing estimations coarse code lateral connections start briefly describing classical estimators linear nonlinear recurrent networks compare performances classical estimators classical methods simplest estimators linear form performance obtained center mass estimator case periodic variable direction motion method complex estimator comp estimator consists fitting cosine pattern activity shown figure phase statistically efficient estimations cortical lateral connections activity time preferred direction figure circular network units connections originating unit shown activity time nonlinear network initialized random pattern activity units plotted function position circle equivalent preferred direction motion choice weights cosine estimate direction method suboptimal data generated cosine tuning functions case illustrated figure obtain optimum performance fitting curve generate data actual tuning curves units maximum likelihood estimate defined direction maximizing involves type curve fitting process illustrated figure estimate computed finding expected hill hill obtained noise free system minimizing distance data case gaussian noise distance measure minimize euclidian squared distance final position peak hill corresponds maximum likelihood estimate recurrent networks circular network units fully connected depicted figure choice weights activation function network develop pattern activity response transient input illustrated figure initialize networks activity patterns responses direction tuned units figure final position hill neuronal array relaxation estimate direction variance estimator depend exact choice activation function weights linear network network units dynamics governed difference equation dynamics networks understood unit receives weight vector weight matrix symmetric case pouget zhang network dynamics suppresses fourier component initial input pattern independently factors equal component fourier transform component resp fourier component initial pattern activity amplified resp suppressed choose network selectively fourier component data suppressing network unstable stop large fixed number iterations activity pattern cosine function direction phase phase fourier components data words network fitting cosine function data equivalent comp method network orientation selectivity proposed closely related linear network method estimate coarse code format suffers problems unclear extended periodic variables disparity suboptimal equivalent comp estimator nonlinear network network units fully connected dynamics governed difference equations demonstrated symmetric weights network develops stable hill response transient input shape hill fully weights function final position hill depends initial input network fits expected function present simulations investigated network place hill methods simulations consisted estimating direction moving based activity input units tuning direction corrupted noise circular normal functions showed figure model activities corresponds spontaneous activity unit peak circular normal functions uniformly spread interval activities depended noise distribution types noise poisson distributed distributed fixed results compare deviation nonlinear network inputs patterns shown iteration case statistically efficient estimations cortical lateral connections noise normal distribution comp noise poisson distribution comp figure histogram standard deviations estimate methods bound compute standard deviation seung sompolinsky weights recurrent network chosen final pattern activity network profile similar tuning function results preferred direction consecutive units network estimates exhibit bias difference estimate true direction directions peaks consecutive units simulations showed significant bias orientations tested shown compared standard deviations estimates methods types noise method found outperform comp estimators cases match bound gaussian noise figure suggested analysis noise poisson distribution standard deviation figure estimated derivative estimate respect initial activity units orientation derivative case matches closely derivative cell tuning curve words units contribute estimate amplitude derivative tuning curve shown figure true matches closely derivative units tuning curves contrast derivatives comp estimate dotted line estimate line match profile units preferred direction units activity noise contributing final estimate performance estimator looked standard deviation function time number iterations reaching stable state hundred iterations make method slow practical purpose found standard deviation decreases rapidly iterations reaches asymptotic values iterations figure wait perfectly stable pattern activity obtain minimum standard deviation analysis determine factors control final position hill find function called lyapunov function minimized time network dynamics cohen grossberg shown network characterized dynamical equation input pattern pouget zhang comp preferred direction time iterations figure comparison solid line comp functions normalized standard deviation function number iterations clamped minimizes lyapunov function form term product input pattern current activity pattern neuronal array simply scaling factor input pattern dynamics network tend minimize equivalently maximize overlap stable pattern input pattern terms dependent shape final stable activity profile depends input pattern network settle compromise maximizing overlap profile clamped input show small input scaling factor dominant term lyapunov function product taylor expansion lyapunov function respect denote profile stable activity limit input write lyapunov function input keeping firstorder terms taylor expansion means product order term disturbances shape final activity profile contribute higher order terms negligible small notice limit input shape activity profile fixed thing unknown peak position constant global minimum lyapunov function correspond peak position maximizes product difference negligible sufficiently small input definition small input network converge solution maximizing primarily mathematically equivalent minimizing square distance input output pattern activity pattern input network stable hill peak position close direction corre statistically estimations cortical lateral connections sponding maximum likelihood estimate assumption gaussian noise provided network attracted local minimum function result valid small clamped input simulations show transient input sufficient reach bound discussion results demonstrate perform efficient unbiased estimation coarse coding neurally plausible architecture model relies lateral connections implement prior expectation profile activity patterns consequence units determine activation input activity neighbors approach shows advantages coarse code provide representation simplifies problem cleaning uncorrelated noise neuronal population unlike comp estimate result voting process units vote preferred direction units turn contribute derivatives tuning curves case feature network ignore background noise responses factors variable interest property predicts discrimination directions vertical affected units tuned prediction consistent psychophysical experiments showing discrimination vertical human affected prior adaptation orientations displaced vertical approach readily extended periodic sensory motor vari ables periodic variables disparity line image network adapted relies circular symmetrical weights simply network sufficient deal values center interval consideration work needed deal boundary values generalize approach arbitrary mapping coarse codes variables function coarse code radial basis functions subsequently approximate arbitrary functions similar approach mappings common situation vision robotics adapting network simultaneously references sompolinsky proc natl acad cohen grossberg ieee trans hirsch differential equations dynamical systems linear algebra academic press york seung sompolinsky proc natl acad
5 learning spariotemporal planning dynamic programming teacher feedforward moving obstacle avoidance gerald department university bonn bonn germany department university bonn bonn germany abstract simple testbed application feedforward shortterm planning robot trajectories dynamic environ ment studied action network embedded sensory system architecture separate world model continuously shortterm predicted spatiotemporal obstacle trajectories receives robot state feedback tion external switching alternative plan ning tasks generates motor actions subject robots kinematic dynamic constraints sions moving obstacles avoided supervised learn distribute examples optimal planner mapping adapted parsimonious higher order network training database generated dynamic programming algo rithm extensive simulations reveal local planner ping highly nonlinear effectively sparsely repre sented chosen powerful model excellent generalization occurs unseen obstacle configurations discuss tations feedforward growing planning learning spariotemporal planning dynamic programming teacher introduction global planning goal directed trajectories subject spatiotemporal statedependent constraints path planning problem considered difficult task suited systems embedded sequential behavior theoretical insights related prob unbounded order minsky practical situations lack globally constraints planning time partially environments question arises extent effective local planning paper problems credit assignment world model fication focus complexity representing local version generic path planning problem feedforward investigate capacity sparse distributed planner representations generalize plans environment robot models environment world robot twodimensional scene occupied obstacles parallel yaxis randomly discretized continuous velocity spectrum environment state list position velocity obstacle environment dynamics obstacles inserted random positions random velocities region distant robots workspace time step obstacles posi tions updated cross robots workspace time robot robot unit mass confined move interval xaxis state denoted time step motor command applied robot robot dynamics notice admissible motor commands depends present robot state settings robot faces fluctuating number obstacles crossing baseline similar situation cross street dynamic obstacles robot figure obstacles crossing robots workspace system architecture functionality adequate modeling cycle importance design intelligent reactive systems partition system modules active perception module builtin capabilities shortterm environment subsequent action module motor command generation figure module represented classical algorithm neural sensory data stream observed internal representation long term goal state action module motor figure system architecture dynamic scene timevarying obstacle positions learning spariotemporal planning dynamic programming teacher temporal internal representation obstacle trajectories time step incidence function safety margin accounting obstacles incidence function defined spatiotemporal cell array based actual position horizon opening angle region robots speed limit cell time step cells potentially reached robot local horizon represented figure functionality current representation figure spacetime representation solution path robot motor command taking account present robot state regard longterm goal firstly realize optimal dynamic programming algorithm bellman supervised learning distribute optimal planning examples neural network dynamic programming solution internal representation time present robot state specification desired longterm goal determines sequence motor commands minimizing cost functional horizon dynamics solution path figure denote desired robot position longterm goal deviations position higher costs costly obstacle collisions excluded restricting search admissible cells obeying training targets time optimal present motor actions minimum attained cases optimal solutions consistently break symmetry order obtain deterministic target mapping neural action model neural motor command generation single layer parsimonious higher order neurons computing outputs target values single neuron optimal input neuron receives components values incidence function binary encoded robot state task bits encoding longterm goal batch training maximize likelihood criterion neuron independently recall motor command obtained decision index active neuron yields motor action applied generally atoms nonlinear interactions input form understood exponent complete forms basis boolean functions expansions combinatorial growth number terms increasing input dimension renders allocation complete basis impractical case action model employing excessive numbers basis functions overfit data preventing generalization structural adaptation algorithm discussed detail automatic identification inclusion sparse relevant nonlinearities present problem effect algorithm performs guided stochastic search exploring space nonlinear interactions means process weight adaptation competition nonlinear terms model restricts number terms orders exponential size small subset terms parsimonious higher order function expansion denotes usual sigmoid transfer function high degrees effectively trained emerged robust generalization difficult nonlinear classification benchmarks learning spariotemporal planning dynamic programming teacher simulation results performed extensive simulations evaluate neural action networks ities generalize learned optimal planning examples planner trained respect alternative longterm goals firstly optimal planner actions time steps simulated fairly moving obstacles longterm goals time step optimal motor commands computed robot states situations excluded path planning horizon considered horizon total admissible training situations left generated full spectrum robot states checked time step states average findings difficulty task repetitions present accumulated patterns reflecting statistics simulated environment original training repeated patterns providing learner information pattern working data base patterns left input neural action consisted length bits encode internal representation cone size figure bits encode robots state single task reports desired goal train single neuron learning maximum epochs cases sufficient successful training classification neurons training patterns misclassified individual motor neurons additional robustness decision recall voting community test generalization neural action model size training figure generalization behavior data base parts training patterns test patterns present training runs performed sizes terms results varying training sizes depicted figure test error decreases increasing training size falls percent training patterns continues decrease larger training sets findings trained architectures emerge robust generalization insight complexity mapping counted number terms carry order resulting distribution maximum order exhibits terms orders higher finally decreases orders exceeding figure planner mapping considered highly nonlinear averaged networks order figure distribution orders discussion conclusions sparse representation planner mappings desirable representation plete policy lookup tables curse dimensional computation plans expensive conflicting realtime requirements reasons investigate capacity trol effective distributed representation robust generalization planner mappings focused type shallow feedforward action network local trajectory planning problem advantage feed forward nets recall important requirement systems acting rapidly changing environments theoretical considerations concern related problem inherent serial character minsky planning problem focus expected hard feedforward nets local planning complex nonlinear planner learning spatiotemporal planning dynamic programming teacher expected powerful neuron model identifies relevant nonlinearities inherent problem determined extremely architectures representation planner mapping compact important features determines optimal plan adapted networks emerged excellent generalization encourage nets difficult local planning tasks care models support effective representation highorder nonlinearities growing planning expected feedforward werbos simple testbed presented insertion testing models system designs including recurrent networks acknowledgement work supported ministry research technology project grant references baum supervised learning probability distributions neural networks anderson neural information processing systems denver american institute physics bellman dynamic programming princeton university press nearoptimal planning robots coupled dynamic bounds proc ieee conf robotics automation structural adaptation boolean higher order neurons superior classification parsimonious topologies proc structural adaptation parsimonious higher order classifiers neural networks finite orthogonal series design digital devices york john wiley sons minsky papert perceptrons cambridge press werbos approximate dynamic programming realtime control neural modeling white handbook intelligent control york
6 hodgkinhuxley type neuron model learns slow oscillation doya allen selverston department biology university california diego jolla peter abstract gradient descent algorithm parameter estimation similar continuoustime recurrent neural networks derived hodgkinhuxley type neuron models brane potential trajectories targets parameters maximal conductances thresholds slopes activation curves time successfully estimated algorithm applied modeling slow oscillation identified neuron lobster stomatogastric ganglion model ionic currents trained experimental data revealed role slow oscillation introduction neuron models formulated hodgkin huxley commonly describing biophysical mechanisms underlying neuronal havior days hodgkin huxley tens ionic channels identified recent type models tens variables hundreds parameters ideally parameters type models deter mined experiments individual ionic currents experiments difficult impossible carry parameters computer simulations model behavior resembles real neuron manual search high dimensional current address salk institute diego hodgkinhuxley type neuron model learns slow oscillation figure view neuron model parameter space unreliable good match found tween model real neuron validity parameters questionable general settings lead apparently behavior propose automatic parameter tuning algorithm type neuron models type model network sigmoid functions multipliers leaky figure tune parameters manner similar tuning connection weights continuoustime neural network models training model initial parameter points match experimental data systematically estimate region parameter space single point test parameters spiking neuron model identified membrane potential trajectories apply learning algorithm model slow oscillation identified neuron lobster stomatogastric ganglion resulting model suggests role slow oscillation membrane potential range standard form ionic currents forms voltage dependency curves repre kinetics ionic channels order derive simple efficient learning algorithm chose unified form voltage dependency curves based statistical physics ionic channels ionic currents model dynamics membrane potential membrane capacitance externally injected current ionic current product maximum conductance activation variable doya selverston inactivation variable difference membrane potential reversal potential exponents represent gating elements ionic channels integer variables assumed obey order differential equation steady states sigmoid functions membrane potential represent threshold slope steady state curve rate coefficients voltage dependence time constant error gradient calculus goal minimize average error cycle period target membrane potential trajectory evaluated variation equation place training effect small change parameter dynamical system ndimensional linear system timevarying coefficients general variation calculus requires parameter case model teacher forcing reduces order linear system effect small change maximum conductance membrane potential estimated derive gradient respect model parameters studies recurrent neural networks shown teacher forcing important training autonomous oscillation patterns type models teacher forcing drives activation inactivation variables target membrane potential hodgkinhuxley type neuron model learns slow oscillation total membrane conductance effect activation threshold estimated equations solution represents perturbation time error gradient similarly parameter update basically arbitrary gradientbased optimization algorithms simple gradient descent conjugate gradient descent algorithm continuoustime version gradient descent normalized parameters parameters type model physical dimensions magnitudes perform simple gradient descent represent parameter default deviation perform gradient descent normalized parameters updating parameters running model integrating error gradient updated parameters online running average gradient averaging time learning rate online scheme susceptible parameter oscillation batch update scheme larger learning rates parameter estimation spiking model tested model random initial parameters estimate rameters model training membrane potential trajectories default parameters model match original model table membrane potential trajectories levels current injection alternately target trials initializing randomly cases error cycles training figure exam oscillation patterns trained model normalized doya selverston table parameters spiking neuron model subscripts leak sodium potassium currents constants default msec msec learning time figure trajectory spiking neuron membrane potential activation inactivation variables dotted line shows target trajectory covariance matrix normalized parameters learning black white squares represent negative positive covariances hodgkinhuxley type neuron model learns slow oscillation table parameters cell model constants tuned msec msec msec wall figure oscillation pattern cell model membrane potential activation inactivation variables ionic currents parameters table implies original parame values successfully estimated learning standard deviation parameter critical setting replicate oscillation terns covariance matrix parameters figure estimate distribution solution points parameter space modeling slow oscillation applied algorithm experimental data cell stomatogastric ganglion isolated cell oscillates acetylcholine sodium channel oscillation period seconds membrane potential approximately data stomatogastric rons assumed potassium current inactivation slow current principal active currents voltage range default parameters currents table doya selverston ionic currents figure curves cell model outward current positive figure model behavior learning cycles actual output model shown solid curve close target output shown dotted curve bottom traces show ionic currents underlying slow oscillation figure shows steady state curves currents negative conductance range resulting positive feedback membrane potential quiescent state rotate diagram degrees similar diagram model faster outward model takes role fast sodium current model slower takes role outward potassium current discussion results gradient descent algorithm effective estimating parameters type neuron models membrane potential trajectories recently automatic parameter search algorithm proposed bower chose maximal conductances free parameters conjugate gradient descent error gradient estimated slightly changing parameters approach error gradient efficiently utilizing variation equations teacher forcing parameter normalization essential gradient descent work order neuron oscillator required fast feedback mechanism balanced slower negative feedback mechanism popular positive feedback sodium current negative feedback potassium current model common calcium current calcium dependent ward potassium current found combination positive negative feedback algorithm inactivation outward activation slow hodgkinhuxley type neuron model learns slow oscillation acknowledgement authors providing physiological data stomatogastric cells study supported part grant references bower exploring parameter space detailed single neuron models simulations granule cells olfactory bulb journal neurophysiology marder mathematical model identified stomatogastric ganglion neuron journal neurophysiology connor walter neural repetitive firing cations hodgkinhuxley axon suggested experimental results axons biophysical journal doya bifurcations learning recurrent neural networks proceed ings ieee international symposium circuits systems pages diego doya selverston learning algorithm hodgkinhuxley type neuron models proceedings pages japan doya yoshizawa adaptive neural oscillator continuoustime backpropagation learning neural networks selverston mechanisms gastric rhythm generation isolated stomatogastric ganglion bursting potential synaptic interactions modulation journal marder ionic currents lateral pyloric neuron stomatogastric ganglion journal neurophysiology ionic channels excitable membranes hodgkin huxley quantitative description membrane currents application conduction excitation nerve journal physiology mechanism channel gating excitable annals york academy sciences selverston learning algorithms oscillatory networks junctions membrane currents network williams zipser gradient based learning algorithms recurrent connectionist networks technical report college computer science university
8 realizable learning task exhibits overfitting laboratory information representation japan email abstract paper examine perceptron learning task task realizable provided perceptron architecture perceptrons nonlinear sigmoid output functions gain output function determines level nonlinearity learning task observed high level nonlinearity leads overfitting give explanation surprising observation develop method avoid overfitting method interpretations learning noise crossvalidated early stopping learning rules examples property makes feedforward neural nets interesting practical applications ability approximate functions amples feedforward networks hidden layer nonlinear units approximate continuous function ndimensional hypercube arbitrarily existence neural function approximators established lack knowledge practical realizations major problems good realization overfitting understanding work study overfitting onelayer perceptron model model good theoretical description exhibits qualitatively similar behavior multilayer perceptron onelayer perceptron input units output unit input output layer adjustable weights output possibly nonlinear function weighted inputs realizable learning task exhibits overfitting quality function approximation measured difference correct output nets output averaged inputs supervised learning scheme trains network examples correct output learning task minimize cost function measures difference correct output nets output averaged examples squared error suitable measure difference outputs define training error generalization error development errors function number trained examples learning curves training gradient theoretical purposes study learning tasks network socalled teacher network concept transparent definition difficulty learning task monitor training process compare student network teacher network directly suitable quantities comparison perceptron case order parameters transparent interpretation normalized overlap weight vectors teacher student norm students weight vector order parameters multilayer learning number increases number permutations hidden units teacher student learning task concentrate case student perceptton learn mapping provided perceptton choose identical networks teacher student sigmoid output function identical network architectures teacher student realizable tasks principle student learn task provided teacher tasks learnt remains finite error distributed random inputs weights weighted assumed gaussian distributed express generalization error order parameters tanh tanh gaussian measure equation student learns gain teachers output function adjusts norm weights gain plays important role tune function linear function highly nonlinear function determine learning curves task emergence overfitting explicit expression weights storage capacity perceptton minimum training error training error implies learnt weights minimal norm fulfill condition hertz note weights completely independent output function simplest realizable case linear perceptron learns linear perceptron statistical mechanics calculation order parameters method statistical mechanics applies commonly replica method details replica approach hertz solution continuous perceptron problem found results statistical calculations exact thermodynamic limit variable natural measure defined fraction number patterns system size thermodynamic limit infinite finite reasonable system sizes theory concentrates temperature limit implies training error accepts absolute minimum number presented examples order parameters case linear perceptron learns linear student temperature limit called exhaustive training student trained absolute minimum reached small high gains levels nonlinearity exhaustive training leads overfitting means generalization error decreasing reason overfitting training strongly examples critical gain determines generalization error increasing decreasing function small values determined linear approximation small order param eters small students approximated linear function simplifies equation expression function upper bound critical gain reached numerical solution higher slope positive small considerations gain intermediate level nonlinearity realizable learning task exhibits overfitting looo figure learning curves problem learns tanh perceptron values gain realizable case exhaustive training lead overfitting gain high understand emergence overfitting evaluation generalization error dependence order parameters helpful shows function exhaustive training realizable cases line independent actual output function means training guided training error generalization error gain higher line starts lower slope results overfitting avoid overfitting guess increases fast compared ratio training process develop description training process training process found order parameters finite temperatures statistical mechanics approach good description training process unrealizable learning task finite temperature order parameters task task linear perceptton learns linear perceptton temperature dependent variable local local figure contour plot defined generalization error function order parameters starting minimum contour lines dotted lines dashed line corresponds solid lines parametric curves order parameters training strategies straight line illustrates exhaustive training lower optimal training explained gain temperature limit corresponds show decrease temperature dependent parameter describes evolution order parameters training process training process natural parameter number parallel training steps parallel training step patterns presented weights updated shows evolution order parameters parametric curves exhaustive learning curve defined parameter solid line training ends curve dotted lines illustrate training process runs infinity simulations training process shown theoretical curve good description training steps description training process definition optimized training strategy optimal temperature optimized training strategy chooses temperature temperature minimizes generalization error lower solid curve indicating parametric curve chosen minimizes function minima solid line absolute minimum parametric curves local minima double dashed lines note optimized related optimized temperature equation parameter related number training steps realizable learning task exhibits overfitting local local simulation figure training process order parameters parametric curves parameters straight solid line corresponds exhaustive learning marks dotted lines describe training process fixed iterative training reduces parameter examples lower solid line optimized learning curve achieve curve chosen minimizes absolutely error minima lines local minimum compare absolute local minimum naive early stopping procedure ends minimum smaller minimum training process simulation early stopping earlier stopping training process avoid overfitting order determine stopping point actual generalization error training crossvalidation provide approximation real generalization error crossvalidation error defined examples training calculate optimum real generalization error determine optimal point early stopping lower bound training finite crossvalidation sets preliminary tests shown small crossvalidation sets approximate real training stopped increases resulting curve standard deviation simulation averaged trials results shown learning curves early stopping strategy avoids overfitting summary paper shown overfitting emerge realizable learning tasks calculation critical gain contour lines imply local local simulation figure learning curves parametric curves upper solid line shows exhaustive training optimized finite temperature curve lower solid line exhaustive optimal training lead identical results marks simulation early stopping finds minimum reason overfitting nonlinearity problem network adjusts slowly nonlinearity task developed method avoid overfitting interpreted ways training finite temperature reduces overfitting realized trains noisy examples interpretation learns noise stops training earlier early stopping guided crossvalidation observed early stopping completely simple lead local minimum generalization error aware possibility applies early stopping multilayer perceptrons built nonlinear perceptrons effects important multilayer learning study large scale simulations miiller shown overfitting occurs realizable multilayer learning tasks acknowledgments amari opper stimulating discussions hints presentation references avoiding overfitting finite temperature learning cross validation international conference artificial neural opper generalization ability percepttons continuous outputs phys hertz krogh palmer introduction theory neural computation reading addisonwesley miiller murata schulten amari large scale simulations learning curves neural computation press
10 hierarchical nonlinear factor analysis topographic maps zoubin ghahramani geoffrey hinton dept computer science university toronto toronto ontario canada httpwww toronto toronto abstract describe hierarchical generafive model viewed nonlinear generalisation factor analysis implemented neural network model performs inference probabilistically consistent manner topdown bottomup lateral connections connections learned simple rules require locally avail information show incorporate lateral nections generafive model model extracts sparse distributed hierarchical representation depth simplified randomdot stereograms disparity detectors hidden layer form topographic presented image patches natural scenes model develops graphically local feature detectors introduction factor analysis probabilistic model realvalued data assumes data linear combination realvalued uncorrelated gaussian sources factors linear combination component data vector assumed corrupted additional gaussian noise major advantage generative model data vector probability distribution space factors multivariate gaussian linear function data tractable compute posterior distribution learning parameters model linear combination matrix noise variances major disadvantage factor analysis linear model insensitive higher order statistical structure observed data vectors make factor analysis nonlinear mixture factor modules captures linear regime data view factors modules large basis functions describing data process selecting module corresponds selecting subset basis functions number subsets consideration linear number modules tractable compute hierarchical nonlinear factor analysis topographic maps full posterior distribution data point mixture model inadequate typical image multiple objects represent pose deformation object representation objects parameters obtained factor represent multiple objects representations pure mixture idea powerful nonlinear generalisation factor analysis large factors subset factors selected achieved generatire model high probability generating factor activations rectified gaussian belief nets rectified gaussian belief multiple layers units states positive real values main disadvantage posterior distribution factors data vector involves gibbs sampling general gibbs sampling time consuming practice samples unit proved adequate theoretical reasons learning work gibbs sampling fails reach equilibrium describe neural plausibility show lateral interactions layer perform probabilistic infer ence correctly locally information makes plausible neural model sigmoid belief means gibbs sampling performed requiring units layer total topdown input units layer generafive model consists multiple layers units realvalued state rectified state negative equal rectification nonlinearity network gaussian distributed standard deviation determined generatire bias combined effects rectified states units layer rectified state gaussian distribution mass gaussian falls concentrated infinitely dense spike shown infinite density creates problems attempt gibbs sampling rectified states suggestion neal perform gibbs sampling states unit intermediate layer multilayer suppose states units perform gibbs sampling stochastically select distribution states units terms energy functions equal negative probabilities constant rectified states units layer contribute quadratic energy term determining states units layer contribute constant positive contribute quadratic term arguments presented paper hold general nonlinear belief networks long noise gaussian specific rectification nonlinearity ghahramani hinton topdown figure probability sity mass gaussian replaced infinitely dense spike schematic density units rectified state bottom topdown energy func tions effect index units layer including terms depend omitted values quadratic energy function leads gaussian distribution true values quadratic gaussian distributions agree distribution piecewise gaussian perform gibbs sampling samples posterior generative weights learned online delta rule maximise probability data variance local gaussian noise unit learned online rule alternatively fixed hidden units effective local noise level controlled scaling generafive weights role lateral connections perceptual inference layered belief networks fixing unit layer correlations parents unit layer main reasons purely bottomup approaches perceptual inference proven inadequate learning layered belief networks fail account phenomenon explaining seung introduced lateral connections handle explaining effects perceptual inference network shown contribution energy state network squared difference states units layer topdown expectations generated states units layer assuming local noise models lower layer units unit variance gibbs sampling long reach equilibrium delta rule gradient penalized probability data penalty term divergence equilibrium distribution distribution produced gibbs sampling things equal delta rule adjusts parameters determine equilibrium distribution reduce penalty models gibbs sampling works quickly hierarchical nonlinear factor analysis topographic maps ignoring biases constant terms unaffected states units expression setting energy function implemented network recognition weights symmetric lateral interactions lateral recognition connections unit compute layer depends state follow gradient perform gibbs sampling figure small segment network showing generative weights dashed recognition lateral weights solid implement perceptual inference correctly handle explaining effects trick eliminates neurally sible aspect model unit layer appears send state topdown prediction state units layer lateral connections units layer effect compute topdown predictions computer simulations simply lateral connection rail product learn lateral connections biologically plausible driving units layer independent gaussian noise simple antihebbian learning rule similarly purely local learning rule learn recognition weights equal generatire weights units layer driven independent gaussian noise turn drive units layer generatire weights hebbian learning layers learn correct recognition weights lateral connections generative model generative model topdown connections lateral connections make perceptual inference locally information desirable lateral connections generatire model connections nearby units layer priori correlated activities turn lead formation redundant codes topographic maps symmetric lateral interactions states units layer effect adding quadratic term energy function corresponds gaussian markov random field sampling term simply added topdown energy contribution learning difficult difficulty stems derivatives partition function data vector partition function depends ghahramani hinton topdown inputs layer varies data vector lateral connections nonadaptive fortunately topdown prediction define gaussians states units layer derivatives easily calculated assuming unit variances matrix layer including units identity matrix term delta rule term derivative partition function involves matrix inversion partition function multivariate gaussian analytical learn lateral connections lateral interactions rectified states units quadratic term partition function longer analytical comput gradient likelihood involves procedure averages respect posterior distribution averages respect posterior distribution prior units layer learning rule suffers problems boltzmann machine slow requires approximation results familiar delta rule equivalent ways treats lateral connections generarive model additional lateral connections recognition model lateral connections generatire model assumes children clamped values affect inference likelihood learning penalized likelihood model lateral connections generarive model discovering depth simplified stereograms generatire process stereo pairs random dots uniformly distributed intensities scattered sparsely onedimensional surface image blurred gaussian filter surface randomly depths giving rise lefttoright disparities images separate gaussian noise added image images generated manner shown figure sample data stereo disparity problem left fight column image inputs left periodic bound conditions pixel represented size square white positive black tive notice pixel noise makes difficult infer disparity vertical shift left fight columns images sample images generated model learning trained threelayer consisting visible units units hidden layer unit hidden layer wide stereo hierarchical nonlinear factor analysis topographic maps disparity problem hidden units hidden layer connected entire array visible units inputs eyes hidden units layer laterally connected units nearby units excited distant units inhibited pattern difference gaussians initialised large weights decayed exponentially training network trained passes data images image iterations gibbs sampling approximate posterior distribution hidden states iteration consisted sampling hidden unit random order states fourth iteration gibbs sampling learning learning rate weight decay parameter level generatire process makes discrete decision left global disparity trivial extension level unit saturates figure generative weights trained stereo disparity problem weights layer hidden unit hidden units biases hidden units weights hidden units visible array hidden units learned local detectors local detectors unit hidden layer learned positive weights detectors layer negative weights detectors fact activity unit true global disparity input images accuracy random sample images generated model learning shown addition forming hierarchical distributed representation disparity units hidden layer topographic caused high correlations nearby units early learning turn resulted nearby units learning similar weight vectors emergence topography depended strength speed decayed results insensitive parametric presented image patches natural images network units hidden layer arranged grid network developed local feature detectors nearby units responding similar features units units clustered area discussion classical models topography formation kohonens elastic thought variations mixture models additional constraints encourage neighboring hidden units similar generative weights problem mixture model handle images things contrast ghahramani hinton figure generarive weights trained natural image patches weights hidden units arranged sheet boundary conditions shown topography arise richer hierarchical generarive models inducing correlations neighboring units sense topography consequence lateral connection trick perceptual inference infeasible interconnect pairs units cortical area assume direct lateral interactions interactions mediated interneurons primarily local widely separated units apparatus required explaining computation posterior distribution incorrect generarive weight vectors widely separated units orthogonal generarive weights constrained positive vectors orthogonal zeros hidden units model typically spatially widely separated units attend parts image units attend overlapping patches laterally interconnected lateral connections generarive model assist formation topography required correct perceptual inference acknowledgements dayan frey goodhill mackay neal revow research funded nserc fellow references bell sejnowski independent components natural scenes edge filters vision research press durbin willshaw analogue approach travelling salesman problem elastic method nature ghahramani hinton algorithm mixtures factor analyzers univ toronto technical report goodhill willshaw application algorithm formation ocular dominance stripes network comp hinton ghahramani generarive models discovering sparse distributed representations trans kohonen selforganized formation topologically correct feature maps cybernetics seung unsupervised learning convex conic coding mozer jordan petsche nips press cambridge lewicki sejnowski bayesian unsupervised learning higher order structure nips press cambridge neal connectionist learning belief networks intell neal hinton view algorithm justifies incremental variants unpublished manuscript
6 fast nonlinear dimension reduction todd leen department computer science engineering oregon graduate institute science technology portland abstract present fast algorithm nonlinear dimension reduction algorithm builds local linear model data merging clustering based distortion measure exper speech image data local linear algorithm produces encodings lower distortion built layer autoassociative networks local linear algorithm order magnitude faster train introduction feature sets compact data represent dimension tion compact representations storage transmission classification dimension reduction algorithms operate identifying eliminating statistical data optimal linear technique dimension reduction principal component anal ysis performs dimension reduction projecting original dimensional data dimensional linear subspace spanned leading eigenvectors covariance matrix builds global linear model data dimensional hyperplane sensitive correlations fails detect higherorder statistical expects nonlinear techniques provide performance compact representations lower distortion paper introduces local linear technique nonlinear dimension reduction demonstrate superiority recently proposed global nonlinear technique fast nonlinear dimension reduction show nonlinear algorithms provide performance speech image data global nonlinear dimension reduction researchers cottrell metcalfe layered feedforward autoassociative networks bottleneck middle layer perform dimension reduction autoassociative nets single hidden layer provide lower distortion bourlard recent work shows layer autoassociative networks improve networks hidden layers figure hidden layers nonlinear response referred mapping layers nodes middle representation layer provide encoded signal layers weights produce projection layers weights produce maps chosen complete mapping input output approximate identity training data data requires projection nonlinear achieve good network principal find functions dimensional encoding original high dimensional representation figure layer feedforward autoassociative network network perform nonlinear dimension reduction dimensions global coordinates built layer network data distributed surface activations representation layer outputs trace coordinates shown solid lines activities nodes representation layer form global input space figure refer layer autoassociative networks global nonlinear dimension reduction technique leen locally linear dimension reduction layer networks drawbacks slow train prone trapped poor local optima accurately global dimensional coordinates data propose alternative suffer problems algorithm pieces local linear coordinate patches local regions defined partition input space induced vector quantizer orientation local coordinates determined figure section present ways obtain partition describe approach euclidean distance describe distortion measure optimal task local figure local coordinates built algorithm data tributed surface solid lines represent principal voronoi cell region covered voronoi cell shown shaded euclidean partitioning clustering euclidean distance local regions hybrid algorithm proceeds steps competitive learning train euclidean distance reference vectors weights perform local voronoi cell cell compute local covariance matrix data respect responding reference vector centroid compute eigenvectors covariance matrix choose target dimension project data vector leading eigenvectors obtain local linear coordinates fast nonlinear dimension reduction encoding consists index reference cell closest euclidean distance component vector decoding reference vector centroid cell leading eigenvectors covariance matrix cell squared reconstruction error incurred denotes expectation respect defined training performing local fast relative training layer network training time dominated distance computations competitive learning computation significantly architecture gray projection partitioning algorithm optimal clustering independently projection goal minimize expected error reconstruction realize expected reconstruction error distortion measure design reconstruction error defined written matrix form matrix rows orthonormal eigenvectors covariance matrix cell squared euclidean distance data local hyperplane expression error suggests distortion measure call reconstruction distance reconstruction distance error incurred approximating local coefficients squared projection difference vector eigenvectors covariance matrix cell clustering respect reconstruction distance directly minimizes expected reconstruction error modified algorithm partition input space reconstruction distance perform local steps algorithm section trained batch mode generalized algorithm gersho gray online competitive learning avoids matrix depends input vector leen experimental results apply layer networks dimension reduction speech images compare algorithms performance criteria training time distortion reconstructed signal distortion measure normalized reconstruction error model construction trained optimization techniques conjugate gradient descent algorithm method press stochastic gradient descent order limit space architectures number nodes mapping fourth layers euclidean distance clustering implemented stan dard quantization multi stage architecture reduces number distance calculations train time gray dimension reduction speech examples twelve vowels extracted continuous speech drawn timit database fisher input vector consists coefficients spanning frequency range time averaged central utterance divided data training vectors validation vectors test vectors validation architecture selection number nodes mapping layers layer nets test utterances speakers represented training validation motivated desire capture formant structure vowel encodings reduced data dimensions experiments reduction dimensions gave similar results reported leen table test reconstruction errors training times encodings significantly lower reconstruction error global layer nets slightly lower reconstruction error slow train search trains orders magnitude faster achieves error times great modified algorithm reconstruction distance measure clustering reconstruction error architectures fast nonlinear dimension reduction table speech data test reconstruction errors training times architec tures represented experiments lowest validation error parameter ranges explored numbers parentheses values free parameters algorithm represented network nodes mapping layers clustering voronoi cells algorithm training time seconds table reconstruction errors training times dimension reduction images architectures represented experiments lowest validation error parameter ranges explored algorithm training time seconds dimension reduction images data consists images faces people grayscale image extracted principal components image experimental data data preparation demers cottrell study dimension reduction layer autoassociative nets demers cottrell trained reduce principal components dimensions divided data training images validation architecture selection images test images reduced images dimensions table configuration varying ments algorithm posed memory time requirements task leen table reconstruction errors training times dimension reduction images training data architectures represented experiments lowest error parameter ranges explored algorithm training time seconds summarizes results notice layer obtains encoding error data takes long time train training data improve results figure representative images left original image recon struction encodings comparison demers demers cottrell work conducted experiments training data results summarized table figure shows sample faces nonlinear techniques produce encodings lower error indicating significant nonlinear structure data data nodes mapping layer demers demers cottrell obtains reconstruction error note algorithms achieve order magnitude improvement layer nets terms speed training accuracy encodings show results order compare experimental results demers data gave encodings higher error posed memory computational requirements reports half output node corresponds fast nonlinear dimension reduction summary presented local linear algorithm dimension reduction propose distance measure optimal task local results speech image data nonlinear techniques provide accurate encodings local linear algorithm produces accurate encodings simulation image data trains faster layer autoassociative networks acknowledgments work supported grants force office scientific research electric power research institute authors grateful cottrell david demers providing image database experimental results colleagues center spoken language understanding providing speech data references bourlard autoassociation multilayer perceptrons singular decomposition biological cybernetics cottrell metcalfe face emotion gender recog nition holons lippmann john moody touretzky editors advances neural information processing systems pages morgan demers cottrell nonlinear dimensionality reduction giles hanson cowan editors advances neural information processing systems mateo morgan kaufmann fisher darpa speech recognition search database specification status proceedings darpa speech recognition workshop pages palo alto gersho gray vector quantization signal compression kluwer academic publishers gray vector quantization ieee assp magazine pages leen fast nonlinear dimension reduction ieee international conference neural networks pages ieee data compression feature extraction autoassociation feed forward neural networks artificial neural networks pages elsevier science publishers northholland press teukolsky recipes scientific computing cambridge university press york
8 constructive algorithms hierarchical mixtures experts cambridge university engineering department cambridge england email abstract present additions hierarchical mixture experts architecture applying likelihood splitting criteria expert grow tree adaptively train probable path tree prune branches temporarily redundant demonstrate results growing path pruning algorithms show significant speed efficient parameters standard fixed structure discriminating spirals classifying parity patterns introduction jordan jacobs tree structured network terminal nodes simple function approximators case regression classifiers case classification outputs terminal nodes experts recursively combined root node form output network gates nonterminal nodes clear similarities tree based statistical methods classi fication regression trees cart breiman friedman olshen stone gate replacing questions asked branch cart analogy application splitting rules build cart start simple tree consisting experts gate partially training simple tree apply split ting criterion terminal node evaluates loglikelihood increase splitting expert experts gate split yields increase loglikelihood added tree process training growing continues desired modelling power reached constructive algorithms hierarchical mixtures experts figure simple mixture experts approach reminiscent cascade correlation fahlman lebiere hidden nodes added multilayer perceptton trained rest network fixed similarities model merging techniques stacked gression wolpert explicit partitions training differs model merging expert considers input space forming output whilst network flexibility gate implicitly partition input space soft manner leads long computation case optimally trained models time paths large network high probability order overcome drawback introduce idea path pruning considers paths root node probability greater threshold classification hierarchical mixtures experts mixture experts shown figure consists experts perform local function approximation expert outputs combined gate form output hierarchical case experts mixtures experts extending architecture tree structured fashion terminal node expert variety forms depending application case classification expert outputs vector element conditional probability class computed softmax function parameter matrix expert denotes class outputs experts combined gate terminal nodes gate outputs estimates conditional probability selecting nonterminal node input path node root node computed softmax function parameter matrix gate denotes expert waterhouse robinson output probabilistic mixture gate outputs mixture weights expert outputs mixture components probability class straightforward extension model conditional probability selecting expert input correct class order train perform classification maximise likelihood variable correct class exemplar expectation algorithm dempster laird rubin jordan jacobs tree growing standard differs tree based statistical models architecture fixed relaxing constraint tree grow achieve greater flexibility network work cart start simple tree instance experts gate train small number cycles network make candidate splits terminal nodes split involves expert pair experts gate shown figure select eventually split candidate splits define split increase loglikelihood split likelihood generation tree make constraint parameters tree remain fixed param figure making split terminal node eters split candidate split made maximisation simplified dependency increases local likelihoods nodes constrain tree growing process find node gains split constructive algorithms hierarchical mixtures experts figure growing figure shows addition pair experts partially grown tree splitting rule similar form cart splitting criterion maximisation entropy node split equivalent local increase loglikelihood final growing algorithm starts tree generation firstly parameters nodes terminal nodes split experts gate split made posterior probabilities node greater small threshold prevents splits made nodes data assigned order break symmetry experts split initialised adding small random noise original expert parameters gate parameters small random weights node evaluate training tree standard method nonterminal node parameters fixed loglikelihood splits parameters split independent splits trained removing train multiple trees separately split evaluated split chosen split splits discarded original tree structure recovered additional winning split shown figure tree generation trained usual present decision split tree fairly straightforward candidate split made training fixed tree number iterations alternative scheme investigated make split loglikelihood fixed tree increased number cycles addition splits rejected local loglikelihood discussed issue overfitting paper number techniques prevent overfitting simple technique cart involves growing large tree successively removing nodes tree performance cross validation reaches optimum alternatively bayesian techniques waterhouse mackay robinson applied waterhouse robinson tree growing simulations algorithm solve parity classification task compared growing algorithm fixed depth binary branches figures enabled growing algorithm significantly speeds computation standard fixed structure final tree shape obtained shown figure showed earlier paper waterhouse robinson problem solved experts gate parity problem solved series classifiers gated parent node intuitively appealing form efficient parameters time evolution loglikelihood time seconds generation final tree structure obtained showing node path root node node evolution generations tree figure growing parity problem growing generations deep binary branching growing path pruning good model data generation process case path pruning clear tree sufficient depth model constructive algorithms hierarchical mixtures experts underlying producing data point expect activation expert tend binary values expert selected time exemplar path pruning scheme depicted figure pruning scheme activation node exemplar activation defined product node probabilities path root node rent node path node root node node exemplar falls threshold ignore subtree parent node training involves statistics tree evaluation involves setting output subtree addition path scheme activa tion nodes nent pruning tion node falls small threshold node pruned completely tree subtrees moved node nodes process solely improve computational efficiency paper regularisation method brain techniques denker solla scheme measure node effective number parameters moody root node figure path pruning path pruning simulations figure shows application pruning algorithm task discriminating spirals pruning solution twospirals takes seconds pruning solution achieved seconds problem encountered implementing algorithm updates parameters tree case high pruning thresholds node visited times training pass data form reliable statistics parameter values unreliable lead instability gates saturated avoid saturation simplified version regularisation scheme waterhouse conclusions presented extensions standard architecture pruning branches training evaluation significantly reduce putational requirements applying tree growing greater flexibility results faster training efficient parameters waterhouse robinson time figure effect pruning spirals classification problem deep binary branching loglikelihood time seconds pruning thresholds experts gates pruning training twospirals task classes crosses circles solution spirals problem references breiman friedman olshen stone classification regression trees denker solla optimal brain damage touretzky advances neural information processing systems kaufmann dempster laird rubin maximum likelihood incomplete data algorithm journal royal statistical society series fahlman lebiere cascadecorrelation learning architec ture technical report school computer science carnegie mellon university pittsburgh jordan jacobs hierarchical mixtures experts algorithm neural computation moody effective number parameters analysis general ization regularization nonlinear learning systems moody hanson lippmann advances neural information processing systems morgan kaufmann mateo california waterhouse robinson classification hierarchical tures experts ieee workshop neural networks signal processing waterhouse mackay robinson bayesian methods mixtures experts touretzky hasselmo advances neural information processing systems press wolpert stacked generalization technical report santa institute suite santa
12 predictive sequence learning recurrent neocortical circuits computational neurobiology sloan center theoretical neurobiology salk institute jolla sejnowski computational neurobiology howard hughes medical institute salk institute jolla abstract neocortical circuits dominated massive excitatory feedback percent synapses made excitatory cortical neurons excitatory cortical neurons massive current excitation neocortex role cortical compu tation recent neurophysiological experiments shown recurrent neocortical synapses governed temporally metric hebbian learning rule describe rule cortex modify recurrent synapses prediction input sequences goal predict cortical input recent past based previous experience similar input sequences show temporal difference learning rule prediction conjunction dendritic backpropagating action potentials reproduces temporally hebbian plasticity observed physiologically biophysical simulations demonstrate network cortical neurons learn predict stimuli develop direction selective responses consequence learning spacetime response properties model neurons shown similar direction selective cells alert monkey introduction neocortex characterized extensive system recurrent excitatory connections neurons area precise computational function massive current excitation remains unknown previous modeling studies suggested role excitatory feedback feedforward inputs recently shown recurrent excitatory connections cortical neurons modified accord temporally asymmetric hebbian learning rule synapses activated slightly cell fires strengthened activated slightly weak information postsynaptic activity cell conveyed back dendritic locations synapses backpropagating action potentials soma paper explore hypothesis recurrent excitation function prediction generation temporal sequences neocortical circuits show research supported sloan foundation howard hughes medical institute predictive sequence learning recurrent neocortical circuits temporal difference based learning rule prediction applied backpropagating tion potentials reproduces experimentally observed phenomenon asymmetric plasticity show learning mechanism learn temporal sequences property direction selectivity emerges consequence learning predict moving stimuli spacetime response plots model neurons shown similar direction selective cells alert macaque temporally asymmetric hebbian plasticity temporal difference learning accurately predict input sequences recurrent excitatory connections network adjusted neurons activated time step achieved temporaldifference learning rule paradigm synaptic plasticity activated synapse strengthened weakened based difference predictions positive tive minimizes errors prediction ensuring prediction generated neuron synaptic modification closer desired details order hebbian learning cortical neurons interpreted form temporaldifference learning model cortical neuron consisting dendrite compartment model based previous study demonstrated ability model reproduce range cortical response properties presence voltage activated sodium channels dendrite allowed backpropagation action potentials soma dendrite study plasticity excitatory postsynaptic potentials elicited time delays respect postsynaptic spiking activation single excitatory synapse located dendrite synaptic currents calculated kinetic model synaptic transmission model parameters fitted recorded currents details synaptic plasticity maximal synaptic conductance amount proportional temporaldifference postsynaptic membrane potential time presynaptic activation time delay parameter yield results consistent previous physiological experiments presynaptic input model neuron paired postsynaptic spiking current pulse soma synaptic efficacy monitored applying test stimulus pairing recording epsp evoked test stimulus figure shows results pairings postsynaptic spike triggered onset epsp peak epsp amplitude increased case decreased case tively similar experimental observations critical window synaptic tions model depends parameter shape backpropagating action potential window plasticity examined varying time interval tween presynaptic stimulation postsynaptic spiking shown figure synaptic efficacy exhibited highly asymmetric dependence spike timing similar physiological data potentiation observed epsps occurred postsynaptic spike maximal potentiation depression observed epsps occurring peak postsynaptic spike depression gradually decreased approaching delays greater neocortical neurons tectal neurons hippocampal rons narrow transition zone roughly model separated tion depression windows sejnowski pairing time synapfic input figure synaptic plasticity model neocortical neuron left panel epsp model neuron evoked presynaptic spike excitatory synapse pairing spike postsynaptic spiking delay pairing induces longterm potentiation panel presynaptic stimulation occurs postsynaptic firing synapse weakened resulting decrease peak epsp amplitude critical window synaptic plasticity obtained varying delay postsynaptic spiking negative refer presynaptic postsynaptic spiking results learning sequences temporally asymmetric hebbian plasticity network model neurons learn sequences learning mechanism simplest case excitatory neurons connected receiving inputs separate input neurons figure pose input neuron fires input neuron causing neuron fire figure spike results subthreshold epsp synapse input arrives time epsp temporal summation epsps fire synapse strengthened synapse hand weakened epsp arrives milliseconds fired subsequent trial input neuron fire turn fire milliseconds input occurs potentiation recurrent synapse previous trials figure input neuron predictive feedback occurrence input activity marked figure inhibition prevents input exciting similarly positive feedback loop neurons avoided synapse weakened previous trials arrows figures figure depicts process potentiation depression synapses function number input sequence decrease latency predictive spike elicited respect timing input shown notice learning spike occurs occurrence input learning occurs input emergence direction selectivity simulations network connected excitatory rons shown figure receiving retinotopic sensory input consisting moving pulses excitation pulse excitation neuron rightward leftward direc tions task network predict sensory input learning recur rent connections neuron network starts firing milliseconds arrival input pulse excitation network comprised chains neurons mutual inhibition dark arrows pairs neurons chains network initialized chain predictive sequence learning recurrent neocortical circuits excitatory neuron excitatory neuron input neuron input input neuron learning learning synapse time number trials time number trials figure learning predict temporally asymmetric hebbian learning network model neurons connected excitatory synapses input neurons inhibit input neurons inhibitory interneurons circles network activity elicited sequence network activity sequence trials learning recurrent synapse recurrent tion fire expected arrival input dashed line allowing inhibit synapse weakened preventing downward arrows show decrease epsp potentiation depression synapses learning synaptic strength defined maximal synaptic conduc tance kinetic model synaptic transmission latency predictive spike learning measured respect time input spike dotted line excitatory neuron received excitation inhibition figure excitatory inhibitory synaptic currents calculated kinetic models synaptic transmission based properties receptors determined recordings maximum conductances synapses initialized small positive values dotted lines figure slight asymmetry recurrent excitatory connections breaking symmetry chains network exposed alternately leftward rightward moving stimuli total trials excitatory connections labeled figure modified accord asymmetric hebbian learning rule figure excitatory connections inhibitory interneuron labeled modified asymmetric antihebbian learning rule reversed polarity rule figure synaptic conductances learned neurons marked figure located corre sponding positions chains trials exposure moving stimuli shown figure solid line initially rightward motion slight asymmetry sejnowski input stimulus rightward recurrent excitatory connections recurrent inhibitory connections neuron synapse number neuron synapse number neuron neuron rightward motion leftward motion figure direction selectivity model model network consisting chains connected neurons receiving retinotopic inputs neuron receives recurrent recurrent inhibition arrows inhibition arrows counterpart chain recurrent connections neuron labeled arise preceding neurons chain inhibition neuron mediated interneuron circle synaptic strength recurrent excitatory connections neurons dotted lines learning solid lines synapses adapted trials exposure alternating leftward rightward moving stimuli responses neurons rightward leftward moving stimuli result learning neuron selective rightward motion neurons chain neuron selective leftward motion preferred direction neuron starts firing milliseconds actual input arrives soma marked recurrent excitation preceding neurons dark triangle represents start input stimulation network initial excitatory connections neuron fire slightly earlier neuron neuron additionally epsps neurons lying left occur fires excitatory synapses neurons strength excitatory synapses neurons inhibitory interneuron weakened learning rules mentioned hand synapses neurons lying side weakened inhibitory connections strengthened epsps connections occur fired synapses neuron interneuron remain postsynaptic firing inhibition backpropagating tion potentials dendrite shown figure trials excitatory inhibitory connections neuron exhibit marked asymmetry excitation neurons left inhibition neurons neuron exhibits opposite pattern connectivity expected neuron found selective rightward motion neuron selective leftward motion figure stimulus motion preferred direction neuron starts firing milliseconds time arrival input stimulus soma marked recurrent excitation preceding neurons conversely motion preferred direction triggers recurrent inhibition preceding neurons inhibition predictive sequence learning recurrent neocortical circuits monkey data model time figure comparison monkey model spacetime response plots left sequence obtained optimally oriented bars positions receptive field complex cell alert monkey cells preferred direction part represented bottom flash duration delay stimulus presentations obtained model neuron stimulating chain neurons positions left fight side neuron lower represent stimulations preferred side upper represent stimulations null side active neuron position chain learned pattern connectivity direction selective neurons comprising chains network code predict moving input stimulus direction average firing rate neurons network preferred direction range cortical firing rates moving stimuli assuming separation excitatory model neurons chain utilizing values cortical magnification factor monkey striate cortex estimate preferred stimulus velocity model neurons fovea periphery values fall range monkey striate cortical velocity preferences model predicts connections direction selective neuron exhibit pattern asymmetrical excitation inhibition similar figure recent study direction selective cells awake monkey found excitation side receptive field inhibition null side consistent pattern connections learned model comparison experimental data background activity model generated incorporating random excitatory inhibitory alpha synapses dendrite model neuron post stimulus time histograms spacetime response plots obtained optimally oriented stimuli random positions cells activating region shown figure good qualitative agreement response plot complex cell model spacetime plots show response onset time increase response preferred direction model recurrent excitation progressively closer cells preferred side firing reduced background rates stimulus onset part plots model recurrent inhibition cells null side response response time appears spacetime maps related neurons velocity sensitivity sejnowski conclusions results show network connected neurons temporal difference based asymmetric hebbian learning mechanism learn predictive model spatiotemporal inputs exposed moving stimuli neurons simulated work learned fire milliseconds expected arrival input stimulus developed direction selectivity consequence learning model predicts direction selective neuron start responding milliseconds preferred stimulus enters retinal input dendritic field predictive neural activity recently reported retinal ganglion cells temporally asymmetric hebbian learning previously suggested mechanism sequence learning explanation asymmetric expansion hippocampal place fields route learning theories require long temporal synaptic plasticity order hundreds milliseconds utilized temporal windows range coincidence detection sequence learning model based window plasticity range roughly consistent recent physiological observations idea prediction sequence learning constitute important goal neocortex previously suggested context statistical information theoretic models cortical processing biophysical simulations suggest implementa tion models cortical circuitry problem encoding generating temporal sequences sensory motor domains hypothesis predictive sequence learning recurrent neocortical circuits provide unifying principle studying cortical structure function references douglas science neurosci neurophysiol zipser neural cornput chance nature neuroscience markram science levy neuroscience proc natl acad zhang ture neurosci gerstner nature kempter neural info proc systems kearns solla cohn press cambridge abbott blum gerstner abbott corn neurosci levy proceedings world congress neural networks montague sejnowski learning memory montague nature science ballard neural computation ballard nature neuroscience barlow perception sutton machine learning sutton barto learning putational neuroscience foundations adaptive networks moore press cambridge mainen sejnowski nature methods neuronal modeling koch segev editors press cambridge berry nature neuron proc natl acad abbott song neural info proc systems kearns solla cohn press cambridge
3 note learning rate schedules stochastic optimization darken john moody yale university yale station haven email abstract present compare learning rate schedules stochastic gradient descent general algorithm includes online backpropaga tion kmeans clustering special cases introduce converge type schedules outperform classical constant running average schedules speed convergence quality solution introduction stochastic gradient descent optimization task find parameter vector minimizes func tion context learning systems typically average objective function exemplars labeled stochastic gradient descent algorithm time recent random exemplar comparison deterministic gradient descent algorithm note learning rate schedules stochastic optimization figure comparison shapes schedules dashed line constant solid line dotted average stochastic step equal deterministic step exemplar stochastic step direction stochastic algorithm preferable exemplar large making average exemplars expensive compute issue addressed paper function choose learning rate schedule order obtain fast convergence good local minimum schedules compared paper constant running average class schedules paper specific equation member class chosen comparison simplest member class find schedules typically outperform classical constant running aver schedules schedules capable optimal asymptotic convergence rate objective function exemplar distribution classical schedules adaptive schedules scope short paper darken moody nonetheless adaptive schedules literature aware order expensive compute large numbers parameters make claim asymptotic optimality darken moody task kmeans clustering sample gradientdescent task choose kmeans clustering problem clus good sample problem study inherent usefulness illustrative qualities clustering portant technique signal compression communications engineering machine learning field clustering frontend function learning speech recognition systems clustering features illustrative stochastic optimization problem adaptive simple local minima small problems significantly means live dimensional space visualization parameter vector simple interpretation lowdimensional points easily plotted understood kmeans task locate points called means minimize distance random exemplar nearest exemplar function minimized nearest exemplar equivalent form density exemplar distribution indicator function region correspond stochastic gradient descent algorithm function nearest exemplar moves directly exemplar fractional distance slight generalization stochastic descent algorithm total number exemplars including current assigned specific problem compare schedules means uniformly distributed unit square simple problem observed local minima global minimum means located centers uniform grid square simulation results presented figures constant schedule constant learning rate traditional choice backprop agation constant rate generally parameter vector means case clustering converge parameters minimum average distance proportional variance depends objective function exemplar statistics exemplars generally assumed unknown residual predicted resulting degradation measures system performance squared classification error instance difficult predict study make parameters converge significant practical interest current practice backpropagation large restart learning smaller shrinking result residual time speed convergence drops note learning rate schedules stochastic optimization clustering problem phenomenon appears local parameter vector poor solution long time slowly running average schedule running average schedule stochastic proximation literature kmeans clustering schedule optimal forms poorly moderate large problem means clear decrease slowly order good solution reached advantage schedule parameter vector proven converge local minimum class schedules guaranteed converge converges quickly stochastic approximation theory stochastic approximation literature grown began paper find conditions learning rate ensure convergence optimal speed ljung find asymptotically sufficient guarantee convergence power schedules work practice darken moody goldstein find order converge optimal rate asymptotically threshold depends objective function exemplars optimal convergence rate achieved running average schedule asymptotically vergence rate running average schedule improved resulting instability small improvements asymptotic convergence rate schedules introduce class schedules guaranteed converge achieve optimal convergence rate stability prob lems schedules characterized features learning rate stays high search time hoped parameters find good minimum times greater learning rate decreases parameters converge cited theory generally directly apply full nonlinear setting interest practical work details relation theory practical applications complete quantitative theory asymptotic darken moody choice asymptotic conditions white darken moody choose simplest class schedules study shortterm linear schedule called learning rate decreases linearly search phase schedule reduces running average schedule conclusions introduced class learning rate schedules stochastic approximation theory large schedules achieve optimally fast asymptotic convergence exemplar distribution objective function constant running average schedules achieve empirical measurements kmeans clustering tasks expectation asymptotic conditions obtain surprisingly quickly additionally schedule improves observed lihood local minima implied kmeans clustering stochastic gradient descent algorithm online backpropagation great interest learning systems community space limitations experiments settings published darken moody preliminary periments confirm generality conclusions extensions work progress includes application algorithms simple gradient descent adaptive algorithms automatically determine search time acknowledgements authors white conversations kauffman developing produce figure work supported grant afosr grant references darken moody fast adaptive kmeans clustering empirical results international joint conference neural networks ieee neural networks council darken moody learning rate schedules stochastic optimization preparation goldstein square optimality continuous time technical report department mathematics university southern california ljung analysis recursive stochastic algorithms ieee trans automatic control methods classification analysis multivariate proc berkeley syrup math stat prob stochastic approximation method math stat note learning rate schedules stochastic optimization white learning artificial neural networks statistical perspective neural computation figure runs classical schedules clustering task exemplars uniformly distributed square dots previous locations means triangles visible final locations means running average schedule exemplars means minimum slowly large constant schedule exemplars means global minimum large average distance small constant schedule exemplars means stuck metastable local minimum small constant exemplars means local minimum global minimum darken moody figure comparison runs schedules cluster task exemplars schedule defined small constant schedule note welldefined transitions metastable local minima large late runs running average schedule runs stick local minimum slowly head global minimum schedule head global minimum suboptimal rate asymptotic slope schedule runs head global minimum optimally quick rate asymptotic slope
12 learning rules stabilization persistent neural activity sebastian seung dept brain cambridge abstract analyze conditions synaptic learning rules based action potential timing approximated learning rules based firing rates form plasticity synapses presynaptic spike postsynaptic spike opposite temporal ordering tial approximated conditions learning rule depends time derivative postsynaptic firing rate learning rule acts stabilize persistent neural activity patterns recurrent neural networks introduction recent experiments demonstrated types synaptic plasticity depend temporal ordering postsynaptic spiking cortical synapses longterm ation induced repeated pairing presynaptic spike postsynaptic spike longterm sion results order reversed dependence change synaptic strength difference postsynaptic presynaptic spike times measured quantitatively pairing function sketched figure positive negative width tens milliseconds pairing function differential post time figure pairing function differential learning change synaptic strength plot versus time difference postsynaptic presynaptic spikes pairing function antihebbian learning differential hebbian learning driven firing rates synaptic learning rule applied poisson spike trains synaptic strength remains roughly constant time rate correspond potentiation depression refer synaptic plasticity hebbian conditions seung potentiation predicted differential driven difference processes potentiation depression pairing function figure characteristic synapses opposite temporal dependence observed electrosensory lobe synapses elec shown figure synapses presynaptic spike postsynaptic order reversed refer differential antihebbian plasticity experiments maximum ranges differential hebbian hebbian pairing functions roughly fairly short compatible descriptions neural activity based spike timing instantaneous firing fact show conditions learning rules approximated ratebased learning rules people studied relationship ratebased learn pairing functions figures lead ratebased learning rules traditionally neural networks depend temporal derivatives firing rates firing rates argue differential hebbian learning rule figure general mechanism tuning strength positive feedback networks maintain shortterm memory analog variable persistent neural activity number recurrent network models proposed explain neural activity motor cortical areas head direction system oculomotor models require precise tuning synaptic strengths order maintain continuously variable levels persistent activity simple illustration tuning differential hebbian learning model persistent activity maintained integrateandfire neuron excitatory studied learning rule pairing functions figure measured repeated pairing single presynaptic spike single postsynaptic spike quantitative measurements synaptic complex patterns spiking activity assume simple model synaptic change arbitrary spike trains contributions pairings presynaptic postsynaptic spikes model exact description real synapses turn approximately valid write spike train neuron series delta functions spike time neuron synaptic weight neuron time denoted change synaptic weight induced presynaptic spikes occurring time interval modeled presynaptic spike paired postsynaptic spikes produced pairing synaptic weight changed amount depending pairing function pairing function assumed nonzero inside interval refer pairing range model presynaptic spike results induction plasticity latency arguments left hand side equation shifted relative limits integral hand side learning stabilization persistent neural activity assume latency greater pairing range time influenced presynaptic postsynaptic spikes time learning rule causal relation ratebased learning rules learning rule driven correlations presynaptic postsynaptic activities dependence made explicit making change variables yields defined crosscorrelation made fact vanishes interval goal relate learning rules based crosscorrelation firing rates number ways defining instantaneous firing rates computed averaging repeated presentations stimulus situations defined temporal filtering spike trains discussion general apply definitions firing rates rate correlation commonly subtracted total correlation obtain spike correlation derive ratebased approximation learning rule rewrite spike simply neglect term shortly discuss conditions good approximation derive form term applying approximation obtain define approximation good firing rates vary slowly compared pairing range learning rule depends postsynaptic rate term dominates learning rule conventional based correlations firing rates sign determines rule hebbian antihebbian remainder paper discuss case holds pairing functions shown figures positive ative areas cancel definition dependence seung postsynaptic activity purely time derivative firing rate differential hebbian learning corresponds figure differential antihebbian learning leads figure summarize case synaptic rate correlations approximated hebbian antihebbian slowly varying rates formulas imply constant postsynaptic firing rate change synaptic strength rate required induce synaptic plasticity illustrate point figure shows result applying differential antihebbian learning spike trains presynaptic spike train generated poisson process postsynaptic spike train generated poisson process rate shifted shift synaptic strength remains roughly constant upward shift firing rate downward shift synaptic strength accord sign differential antihebbian rule ratebased approximation works term important return issue general conditions term neglected poisson spike trains spike correlations limit finite integral term fluctuations amount depends pairing range sets limits integration figure long pairing range made fluctuations small small hand short fluctuations small large averaging large relevant amplitude small rate learning slow case takes long time significant synaptic accumulate plasticity effectively driven integrating long time periods brain spike correlations observed limit unlike poisson spike trains correlations roughly symmetric case produce plasticity pairing functions figures hand spike correlations asymmetric lead substantial effects recurrent network dynamics learning rules depend presynaptic postsynaptic rates learn rules neural networks special feature depend time derivatives computational consequences recurrent neural networks form classical neural network equations derived realistic models method averaging field approximation firing rate neuron identified cost function quantifies amount drift firing rate point state space network function defined gradient cost function respect assuming monotonically increasing function differential hebbian update increases cost function learning stabilization persistent neural activity increases magnitude drift velocity contrast differential hebbian update decreases drift velocity suggests differential antihebbian update creating fixed points network dynamics persistent activity spiking model preceding arguments drift velocity based approximate ratebased learning network dynamics important implement learning spiking network dynamics check approximations valid numerically simu lated simple recurrent circuit integrateandfire neurons shown core circuit memory neuron makes receives synaptic input input neurons tonic neuron excitatory burst inhibitory burst neuron circuit store short term memory analog variable persistent activity strengths tonic synapse precisely show accomplished spike based learning rule antihebbian pairing function figure memory neuron equations tonic excitatory burst memory inhibitory burst figure circuit diagram model membrane potential reaches spike considered occurred reset spike time jump synaptic activation size decays exponentially time constant spike synaptic conductances memory neuron term recurrent excitation strength synaptic activations tonic excitatory burst inhibitory burst neurons governed equations differences neurons synaptic input firing patterns determined applied tonic neuron constant applied current makes fire roughly figure excitatory inhibitory burst neurons applied current current pulses bursts action potentials shown figure synaptic strengths arbitrarily learning burst neurons transient firing rate memory neuron applying learning rule tune memory seung tuned figure tuned activity middle traces membrane potentials input neurons figure spikes drawn reset times integrateandfire neurons learning activity memory neuron persistent shown trace learning rule applied synaptic weights burst inputs persistent activity neuron maintain persistent activity intervals burst made synaptic differential hebbian pairing function spike time differences range resulting increase persistence time figure values synaptic weights versus time quantify performance system maintaining persistent activity determined relationship long sequence intervals defined reciprocal interspike interval fixed optimally tuned values residual drift shown figure parameters allowed adapt continuously good tuning achieved residual drift smaller magnitude learning rule synapfic weights interval reducing drift firing rate learning driven autocorrelation spike train cross correlation peak effect pairing function origin autocorrelation small time lags fairly large pairing range simulations recurrent network neurons shorter pairing range suffice crosscorrelation vanish discussion shown differential antihebbian learning tune recurrent circuit main tain persistent neural activity behavior understood reducing learning rule ratebased learning rules ratebased approx good conditions satisfied pairing range large rate learning slow spike synchrony weak effect learning shape pairing function differential antihebbian pairing function results learning rule negative feedback signal reduce amount drift firing rate illustrated simulations integrateandfire neuron excitatory generally learning rule relevant tuning strength positive feedback network maintain shortterm memory analog variable persistent neural activity learning stabilization persistent neural activity rate figure tuning persistence time activity increases weight tuned transition driven bursts input systematic relationship drift firing rate measured long sequence intervals weights continuously drift fixed weights learning rule improve robustness oculomotor head direction parameters differential forms learning rules assumed areas positive negative pairing function equal integral defining vanishes reality cancellation exact ratio limit persistence time achieved learning rule oculomotor integrator head direction system integrate vestibular inputs produce activity patterns problem finding general present learning rules train networks integrate open references markram science hebb organization behavior wiley york bell grant nature gerstner kempter hemmen wagner nature abbott song neural info proc syst cornput neurosci kempter gerstner hemmen phys georgopoulos science wang cornput neurosci zhang neurosci robinson biol cybern seung proc natl acad seung tank neuron neural cornput sompolinsky neurosci abstr seung tank cornput neurosci part theory
12 relevance vector machine michael tipping microsoft research george house street cambridge abstract support vector machine stateoftheart technique regression classification combining excellent generalisation properties sparse kernel representation suffer number disadvantages notably absence prob outputs requirement estimate tradeoff parameter mercer kernel functions paper introduce relevance vector machine bayesian treat ment generalised linear model identical functional form suffers disadvantages examples demonstrate comparable generalisation formance requires dramatically fewer kernel functions introduction supervised learning examples input vectors targets real values regression class labels classification training learn model dependency targets inputs objective making accurate predictions previously unseen values realworld data presence noise regression class overlap classification implies principal modelling challenge avoid overfitting training successful approach supervised learning support vector machine makes predictions based function form model weights kernel function feature classification case target function attempts minimise number errors made training simultaneously maximising margin classes feature space implicitly defined kernel effective prior avoiding overfitting leads good generalisation results sparse model dependent subset kernel functions training examples margin wrong side stateoftheart results reported tasks svms applied relevance vector machine support vector methodology exhibit significant disadvantages predictions probabilistic regression outputs point estimate classification hard binary decision ideally desire estimate conditional distribution order capture uncertainty prediction regression form crucial classification posterior probabilities class membership adapt varying class priors asymmetric misclassification costs sparse svms make kernel functions number grows size training estimate tradeoff parameter regression parameter generally entails crossvalidation procedure data computation kernel function satisfy condition paper introduce relevance vector machine probabilistic sparse kernel model identical functional form adopt bayesian approach learning introduce prior weights governed hyperparameters weight probable values iteratively estimated data sparsity achieved practice find posterior distributions weights sharply peaked unlike support vector classifier weights examples close decision boundary represent prototypical examples classes term examples relevance vectors principle automatic relevance determination motivates presented approach feature capable generalisation formance comparable equivalent typically dramatically fewer kernel functions suffers limitations outlined section introduce bayesian model initially regression define procedure obtaining hyperparameter values weights section give examples application regression case developing theory classification case section examples classification section concluding discussion relevance vector regression dataset pairs follow standard formula tion assume gaussian distribution modelled defined likelihood dataset written design matrix maximumlikelihood estimation generally lead severe overfitting encode preference smoother functions defining gaussian prior weights tipping vector hyperparameters introduction individual weight feature model ultimately responsible sparsity properties posterior weights tained bayes rule defined treated hyperparameter estimated data integrating weights obtain marginal likelihood evidence hyperparameters ideal bayesian inference define integrate hyperparameters performed closedform adopt procedure based mackay optimise marginal likelihood respect essentially type maximum likelihood method equivalent finding maximum assuming uniform make predictions based maximising values note hyperparameters values maximise obtained closed form alternative formulae iterative reestimation weights hidden variables approach direct differentiation defined quantities interpreted measure parameter data generally update observed exhibit faster convergence noise variance methods lead reestimate practice reestimation find approach infinity infinitely peaked implying kernel functions pruned space detailed explanation occurs occam penalty paid smaller values appearance determinant marginal likelihood lesser penalty paid explaining data increased noise case relevance vector machine examples relevance vector regression synthetic function function commonly illustrate support vector regression place classification margin region introduced tube function errors case support vectors edge region linear spline kernels approximation based noisefree samples support vectors comparison approximate function relevance vector model kernel case noise variance approximating function plotted figure left requires relevance vectors largest error compared case figure illustrates case gaussian noise standard deviation added targets approximation relevance vectors noise automatically estimated figure relevance vector approximation noisefree data left added gaussian noise estimated functions drawn solid lines relevance vectors shown case true function shown dashed benchmarks table illustrates regression performance popular benchmark datasets synthetic functions results averaged generated training sets size test boston housing dataset averaged splits prediction error obtained number kernel functions required support vector regression relevance vector regression errors kernels dataset friedman friedman friedman boston housing tipping relevance vector classification extend relevance vector approach case classification desired predict posterior probability class membership input linear model applying logistic sigmoid function writing likelihood integrate weights obtain marginal likelihood analytically iterative procedure based mackay current fixed values find probable weights location posterior mode equivalent standard opti logistic model efficient iteratively leastsquares algorithm find maximum compute hessian inverted give covariance gaussian approximation posterior weights hyperparameters updated note noise variance procedure repeated suitable convergence criteria satisfied note bayesian treatment multilayer neural networks gaussian approximation considered weakness method posterior mode probability mass note hessian considerably confidence gaussian approximation examples classification synthetic gaussian mixture data artificially generated data dimensions order illustrate graphically selection relevance vectors class denoted sampled single gaussian overlaps small degree class sampled mixture gaussians relevance vector classifier compared support vector counterpart gaussian kernel selected cross validation training results typical dataset examples figure test errors comparable remarkable feature contrast complexity classifiers support vector machine kernel functions compared relevance vector method notable relevance vectors distance decision boundary analysis observation consistent hyperparameter update equations qualitative tion output basis function lying decision boundary poor indicator class membership basis functions naturally bayesian framework relevance vector machine figure results training functionally identical left clas typical synthetic dataset decision boundary shown dashed vectors shown dramatic reduction complexity model real examples table give error complexity results pima diabetes usps handwritten digit datasets task recently illustrate bayesian classification related gaussian process technique authors split data training test examples result case dataset popular support vector benchmark comprising training examples test result errors kernels dataset pima usps terms prediction accuracy superior pima outperformed digit data consistent examples paper classifiers fewer kernel functions achieves stateoftheart performance diabetes dataset kernels noted reduced methods exist subsequently pruning support vector models reduce required number kernels expense increase error results usps data discussion examples paper effectively demonstrated relevance vector machine attain comparable regression apparently superior level generalisation accuracy support vector approach time dramatically fewer kernel functions implying considerable tipping saving memory computation practical implementation importantly benefit absence additional nuisance parameters choose type kernel parameters fact case kernel parameters obtained improved terms accuracy sparsity results benchmarks section marginal likelihood respect multiple input scale parameters gaussian kernels exploit bayesian formalism guide choice kernel noted presented methodology applicable arbitrary basis functions limited mercer kernels advantage classifier standard formulation prob generalised linear model implies extended case principled manner train heuristically combine multiple classifiers standard practice estimation posterior probabilities class membership major benefit convey principled measure uncertainty prediction essential adaptation varying class priors incorporation asymmetric misclassification costs noted principal disadvantage relevance vector meth complexity training phase repeatedly invert hessian matrix requiring storage computa tion large datasets makes training considerably slower memory constraints limit training examples developed approximation methods handling larger datasets employed usps handwritten digit note case bayesian methods generally strongest data sparseness resulting classifier induced bayesian framework presented motivation apply relevance vector techniques larger datasets acknowledgements author wishes chris bishop john platt bernhard schslkopf helpful discussions sequential minimal optimisation code references berger statistical decision theory bayesian analysis springer york edition mackay bayesian interpolation neural computation mackay evidence framework applied classification networks neural computation mackay bayesian nonlinear modelling prediction competition transactions pages efficient training networks classification proceedings pages london neal bayesian learning neural networks springer york schslkopf burges smola input space versus feature space methods ieee transactions neural networks vapnik statistical learning theory wiley york williams barber bayesian classification gaussian processes ieee trans pattern analysis machine intelligence
11 compared ratebased hebbian learning richard kempter institut technische germany gerstner swiss institute technology center systems switzerland hemmen institut fiir technische germany abstract correlationbased learning rule spike level formulated mathematically analyzed compared learning description differential equation learning dynamics derived assumption time scales learning spiking separated linear neuron model receives timedependent stochastic input show spike correlations time scale play role corre lations input output spikes tend stabilize structure formation provided form learning window accordance principle conditions intrinsic average synaptic weight discussed introduction learning rules formulated terms firing rates continuous variable reflecting activity neuron hebbian hebb learning rule driven correlations presynaptic postsynaptic rates generate neuronal receptive fields linsker mackay miller properties similar real neurons ratebased description effects pulse structure neuronal signals recent years experimental author kempter gerstner hemmen theoretical evidence accumulated suggests temporal spikes scale play important role neuronal information processing bialek cart abeles gerstner synaptic efficacy depend precise timing postsynaptic action potentials presynaptic input spikes markram zhang synaptic weight found increase presynaptic firing postsynaptic spike decreased contrast standard rate models hebbian learning learning rule discussed paper takes effects account mathematical details numerical simulations reader referred kempter derivation learning equation specification hebb rule neuron receives input synapses efficacies assume induced postsynaptic spikes learning rule consists parts time input spike arriving synapse arrival spike induces weight change amount positive negative output spike neuron consideration event triggers change efficacies amount positive negative finally time differences input spikes influence change efficacies time difference input output spikes changed amount learning window real valued function learning window motivated local chemical processes level synapse gerstner simply assume learning window exist arbitrary functional dependence figure learning function delay postsynaptic firing time presynaptic spike arrival synapse note spike postsynaptic firing starting time efficacy total change time interval calculated summing contributions input output spikes time interval describing input spike train synapse series functions similarly output spikes formulate rules separation time scales total change subject noise stochastic spike arrival possibly stochastic generation output spikes study expected development weights denoted angular brackets make righthand side divide sides compared ratebased hebbian learning expectation interpret instantaneous firing rates vary short time scales shorter average interspike intervals model consistent idea temporal coding rely temporally averaged firing rates note integral time righthand side temporal averaging important larger typical interspike intervals define firing rates notation firing rates distinguished previously defined instantaneous rates defined expectation high temporal vary slowly time scale tion contrast firing rates order function time learning time larger width learning window integration extended introducing noticeable error definition temporally averaged correlation term reduces correlations postsynaptic spikes enter hebbian learning convolved learning window remark correlation change function fast time scale note definition implies presynaptic spike output spike expect excitatory synapses positive correlation input output usual theory hebbian learning require learning slow process correlation evaluated constant lefthand side rewritten differential slow time scale learning relation ratebased hebbian learning neural network theory hypothesis hebb hebb formulated learning rule change synaptic efficacy depends corre lation firing rate presynaptic firing rate postsynaptic neuron proportionality constants decay proportional product input term hebbian term rapidly changing instantaneous rates found auditory system auditory nerve carries noisy spike trains stochastic intensity modulated frequency applied acoustic tone barn significant modulation rates frequency cart kempter gerstner hemmen output rates synaptic driven separately postsynaptic rates parameters depend equation general formulation order rates linsker approximations correla tions input output spikes correlations contained rates approximate rates change slowly compared assumed learning time long compared width learning window simplify identify comparison identify reduce setting assumption derive hold general results markram width learning window cortical pyramidal cells range rate formulation requires activity slow time scale necessarily case existence oscillatory activity cortex range implies activity faster activity time scale found auditory system cart correlations activities additional correlations spikes exist reasons learning rule simple rate formulation insufficient study full learning equation stochastically spiking neurons poisson input stochastic neuron model proceed analysis determine correlations input spikes synapse output spikes correlations depend strongly neuron model consideration highlight main points learning study linear poisson neuron model input spike trains arriving synapses statistically independent poisson process timedependent intensities spike arriving synapse postsynaptic potential time assume excitatory epsp amplitude synaptic efficacy membrane potential neuron linear superposition contributions resting potential output spikes assumed generated stochastically time dependent rate depends linearly membrane potential linear function equality sign formally compared ratebased hebbian learning interpreted spontaneous firing rate excitatory synapses negative impossible equality sign sums spike arrival times synapses note spike generation process independent previous output spikes poisson model include refractoriness context interested expectation values input output expected input expected output expected output rate depends convolution input rates denote convolved rates expected correlations input output term inside square brackets spontaneous output rate term specific contribution input spike time output rate vanishes contributions synapses output spike time inserting assuming weights constant time interval obtain excitatory synapses term positive contribution correlation function recall means presynaptic spike postsynaptic firing figure interpretation term square brackets dotted line bution input spike time output rate function adding rate tion dashed line obtain rate inside square brackets full line time contribution input spike time learning equation assumption identical constant input rates reduces number free parameters eliminates effects rate coding introduce define find evolution slow time scale learning kempter gerstner hemmen discussion equation central result analysis describes expected dynamics synaptic weights hebbian learning rule assumption linear poisson neuron linsker derived mathematically equivalent equation starting linear graded response neuron ratebased model equation type analyzed mackay miller difference linskers equation slightly notation term interpretation interpretation correlations spikes time scales milliseconds enter driving term structure formation contrast linskers ansatz based firing rate description term correlations firing rates term firing rates place standard interpretation rate coding firing rate corresponds temporally averaged quantity averaging window hundred milliseconds formally define rates temporal averaging averaging window sense linskers rates made precise note asymmetric rates convolved relevance term important difference linskers ratebased learning rule existence argue causal chain events positive loss generality integral restricted response kernel vanishes excitatory synapses positive experiments excitatory synapses show positive markram zhang integral positive general argument based literal interpretation statement hebb recall means spike postsynaptic spiking excitatory synapses presynaptic spike postsynaptic firing postsynaptic tivity hebb puts contributed firing postsynaptic cell hebb rule predicts excitatory synapses positive claimed positive term rise exponential growth weights existing structure distribution weights enhanced contributes stability weight distributions strong synapses gerstner compared ratebased hebbian learning intrinsic normalization suppose input synapse special impose weak condition independent synapse index find average weight fixed point fixed point stable shown assumption enforce stability term sufficiently negative turn definition achieve integral sufficiently negative corresponds learning rule average antihebbian linear term sufficiently negative addition excitatory synapses reasonable fixed point positive stable fixed point turn implies sufficiently positive intrinsic normalization synaptic weights interesting property neurons stay optimal operating point synapses changing auditory neurons mechanism stay learning regime coincidence detection gerstner kempter cortical neurons principles operate regime high variability abbott nips talk volume conclusions learning simple ratebased learning rules spike based learning rule pick correlations input time scale mathematically main difference ratebased hebbian learning existence term accounts causal relation input output spikes correlations input output spikes time scale play role tend stabilize existing strong synapses references abeles domany editors models neural networks york springer bialek science cart neurosci gerstner nature gerstner maass bishop editors neural networks cambridge hebb organization behavior wiley york kempter neural comput kempter phys press linsker proc natl acad mackay miller network markram science preprint univ biol cybern zhang nature
6 temporal difference learning position evaluation game schraudolph peter dayan terrence sejnowski computational neurobiology laboratory salk institute biological studies diego abstract game high branching factor tree search approach computer chess longrange interactions make position evaluation extremely difficult development conventional programs nature demonstrate viable alternative training networks evaluate positions poral difference learning approach based network architectures reflect spatial organization input reinforcement signals board training provide exposure unlabelled play techniques yield performance networks trained play network weights learned games position evaluation function enables primitive search commercial program playing level introduction developed popular board games world chess deterministic perfect information game strategy players alternate schraudolph dayan sejnowski placing black white intersections grid smaller objective surrounding board area opponent adjacent color form groups empty intersection adjacent group called group group captured removed board occupied opponent prevent loops make move prior board position player pass time game ends players pass succession unlike games remained skill computers acquire recognized challenge artificial intelligence rivest game tree search approach extensively computer chess infeasible game tree average branching factor ahead situations humans rely static evaluation board positions highly selective deep local lookahead conventional programs carefully tuned expert systems fundamentally limited human assistance integrating domain knowledge play level human machine learning approach offer considerable advantages shown optimization approach work principle obtained inefficient play selecting moves simulated annealing kirkpatrick game pattern recognition component inherent amenable connectionist methods supervised backpropagation networks applied game face bottleneck hand labelled training data propose alternative approach based predictive learning algorithm sutton sutton barto successfully applied game backgammon tesauro tdgammon program backpropagation network features board position output reflecting probability player move trained playing learned evaluation function coupled full lookahead pick estimated move made competitive human players world tesauro early experiment investigated straightforward adaptation approach domain trained fully connected backpropagation network randomized selfplay board standard size humans output learned predict margin black network learn past weak domain program games training found efficiency learning improved appropriately structured network architectures training strategies focus sections unlike backgammon deterministic game generate moves stochas ensure sufficient exploration state space gibbs pling geman geman values obtained search annealing temperature parameter random play temporal difference learning position evaluation game evaluation reinforcement symmetry processed constraint satisfaction feature maps board connectivity symmetry groups figure modular network architecture takes advantage board translation invariance localized reinforcement evaluate positions shown connectivity prediction mechanism discussion network architecture advantages predictive learning richer information game unlike chess backgammon pieces board left generally remain makes final state board informative respect play game scored summing contributions point board make spatial credit assignment accessible network predict point board score evaluate positions bears similarity successor representation dayan integrates vector scalar knowledgebased approach existing programs input features adopt stronger programs order demonstrate reinforcement learning viable alternative conventional approach require networks learn features complexity task significantly reduced exploiting number sharing information network multiple outputs restricts efficient implementation note tesauro constraint found optimal schraudolph dayan sejnowski constraints hold priori domain patterns retain properties color reversal reflection rotation board modulo considerable influence board edges translation invariances reflected network architecture color reversal invariance implies changing color stone position player turn move yields equivalent position players perspective build constraint directly networks input values black white squashing functions bias input turn move positions invariant respect reflection rotation symmetry square provided mechanisms constraining network obey invariance weight sharing summing derivatives beneficial evaluation network appears learning account translation invariance convolution weight kernel multiplication weight matrix basic mapping operation network layers feature maps produced scanning fixed receptive field input advantage technique easy transfer learned weight kernels board sizes noted edge board affects local play modulates aspects game forms basis opening strategy account allowing node network bias weight giving degree freedom neighbors enables network encode absolute position modest number adjustable parameters provide additional redundancy board edges selective convolution kernels wide input figure illustrates modular architecture suggested experiments implement features shown connectivity lateral constraint satisfaction subject future work training strategies temporal difference learning network predict consequences strategies basis play produce question arises strategies generate large number games needed training identified criteria compare alternative training strategies computational efficiency move generation quality generated reasonable coverage plausible positions investigating phenomenon temporal difference learning position evaluation game tesauro trained tdgammon selfplay networks position eval training pick players moves technique require external source rules game network teacher deterministic game pick estimated move training selfplay running risk ping network suboptimal fixed state happen network playing white predict network playing black advantage changing outcome forcing predictions change practice concern pick moves stochastically gibbs sampling geman geman probability move exponentially related predicted position leads temperature parameter controls degree randomness found selfplay cumbersome reasons search evaluate legal moves computationally intensive investigating faster ways accomplish expect move evaluation remain computational burden learning selfplay network bootstrap benefit exposure training network moves based predictions instance learn playing conventional program observing games human players computer train networks random move generator program commercial program faces random move generator naturally doesnt play advantages high speed ergodicity thousand games random proved effective prime networks start training conventional programs contrast slow deterministic suitable generators training data playing make good network provide required variety play gibbs sampler training games played dissimilar players match strength prevent trivial predictions outcome faces standard purpose modified play random moves proportion random moves reduced adaptively network improves providing online performance measure cases strategies players predictions expect correct network playing real opponent problem strategy choosing moves learning policy adopted optimal network play samuel found program learn games opponent predictions reflect poor good play form overfitting network learn predict strategy detail play general order ensure minimum stability fill eyes locally recognizable type move schraudolph dayan sejnowski architecture figure small network learned play boxes architecture panel represent layers units turn single bias unit arrows weight kernels black represent white inhibitory weights matrix disk area proportional weight magnitude results exploring domain trained networks variety methods small sample network learned beat faces playing level games training shown figure network grown training adding hidden layers time trained reflection rotation symmetry constraint weight kernels learned approximately symmetric features direct projection board reinforcement layer interesting structure negative central weight positive surround stems fact stone loses point nearby areas note wide projections hidden layers considerable trick network incorporate edge effects prominent bias projections turn unit compared training architecture selfplay versus play initial rate learning similar starts outperform measured faces demonstrating advantage opponent games starts overfit faces switching training faces point produced games network reliably beat opponent capable selfplay network manage edge past games compares favorably temporal difference learning position evaluation game network introduction verified weights learned offer suitable basis training board discussion general networks opening game suggests reinforcement information gating back final position hard network capture situations complex character strengths weaknesses partially complement symbolic systems suggesting hybrid approaches plan improve network performance number ways augment input representation network task fully intend adding extra input layer nodes active points board inactive occupied color explicit representation makes states point board black stone white stone empty linearly separable network eliminates special treatment board edges limited receptive field sizes raises problem account spatial interactions board distance groups interact function arrangement context important problem position evaluation compute connectivity groups intend model connectivity explicitly training network predict correlation pattern local reinforcement position information control lateral propagation local features hidden layer constraint satisfaction mechanism train networks recorded games human players internet server steady quantities format beginning explore promising supply instantaneous play training main obstacle encountered human practice game players agree outcome typically position scored reached address issue eliminating early training bring remaining games completion shown sufficient attention network architecture training procedures connectionist system trained temporal difference learning achieve significant levels performance domain acknowledgements grateful patrice simard tesauro helpful discussions casey game records internet server geoff hinton support provided mcdonnellpew center cognitive neuroscience nserc howard hughes medical institute schraudolph dayan sejnowski references barto sutton anderson neuronlike adaptive elements solve difficult learning control problems ieee transactions systems cybernetics monte carlo manuscript internet file transfer file dayan improving generalization temporal difference learning successor representation neural computation program technical report carnegie mellon university report internet anonymous file transfer file knowledge representation faces manuscript internet anonymous file transfer file geman geman stochastic relaxation gibbs distributions bayesian restoration images ieee transactions pattern analysis machine intelligence kirkpatrick vecchi optimization simulated annealing science boser denker henderson howard hubbard jackel backpropagation applied handwritten code recognition neural computation playing program program internet anonymous file transfer file ally rivest press forthcoming talk computational learning theory natural learning systems versus silicon matching tdgammon inside backgammon samuel studies machine learning game journal research development machine learning applied masters thesis case university internet anonymous file transfer file sutton temporal credit assignment reinforcement learning thesis university massachusetts amherst sutton learning predict methods temporal differences machine learning tesauro practical issues temporal difference learning machine learn tesauro tdgammon backgammon program achieves play neural computation
1 applications backpropagation phonetic classification hong leung spoken language systems group laboratory computer science massachusetts institute technology cambridge abstract paper error backpropagation phonetic classification objective investigate characteristics backpropagation study frame work multilayer perceptrons exploited phonetic recog nition explore issues integration heterogeneous sources information affect performance phonetic classification internal representations comparisons traditional pattern classification techniques comparisons differ error metrics initialization network tion performed experiments attempts vowels american english independent speaker results comparable human performance early approaches phonetic recognition fall major extremes heuristic algorithmic approaches shortcomings heuristic approach intuitive appeal focuses linguistic informa tion speech signal exploits knowledge weak control strategy utilizing knowledge inadequate extreme algorithmic approach relies primarily powerful trol strategy offered pattern recognition techniques speech knowledge accumulated past decades incorporated algorithms feel artificial neural networks characteristics potentially enable bridge extremes hand speech knowledge provide guidance structure design network hand selforganizing mechanism provide control strategy utilizing knowledge paper extend earlier work artificial neural networks phonetic recognition specifically focus investigation sets issues describe network integrate heterogeneous sources information classification performance improves error backpropagation phonetic classification information discuss important factors affect performance phonetic classification examine internal representation network fourth compare network traditional classification techniques knearest neighbor gaussian classifica tion finally discuss specific implementations backpropagation yield improved performance efficient learning time experiments investigation performed context experiments attempts recognize vowels american english independent speaker vowels continuous speech preceded phonemes providing rich environment study contextual influence assume locations vowels detected time region network determines vowels spoken corpus table shows training consists vowel tokens continuous sentences spoken male female speakers test consists vowel tokens sentences spoken ferent speakers data extracted timit database wide range american variations speech signal represented spectral vectors obtained auditory model speaker energy performed tokens sentences speakers training testing table corpus extracted timit database network structure structure network examined extensively hidden layer shown figure output units unit vowels order capture dynamic information vowel region divided equal average spectrum computed average spectra applied sets input units additional sources information duration local phonetic contexts made network spectral inputs continuous numerical contextual inputs discrete symbolic leung context context output auditory model synchrony spectrogram duration figure basic structure network heterogeneous information integration earlier study examined integration synchrony phonetic contexts synchrony output auditory model shown enhance formant information study additional sources information figure shows performance heterogeneous sources information made network performance synchrony performance improves rate response output auditory model shown enhance temporal aspects speech signal performance improves contextual inputs provided network experiment suggests network make heterogeneous sources information numerical andor symbolic error backpropagation phonetic classification human recognize vowels experiments performed study human agree sequences phonemes phoneme vowel vowel phoneme vowel results average agreement identities vowels synchrony rate duration phonetic response context sources information figure integration heterogeneous sources information performance results important factors network performance amount information network gain additional insights network performs conditions experiments conducted databases subsequent experiments describe paper synchrony network table shows performance results recognition tasks tasks network trained tested independent sets speech data task recognizes vowels spoken speaker spoken isolation recognition task straightforward resulting perfect performance experiment vowel tokens extracted phonetic context spoken male female speakers variability accuracy degrades task recognizes vowels spoken speaker restricted context spoken continuously accuracy decreases finally data timit database spoken multi speakers accuracy drops results substantial difference performance expected conditions depending task speakerindependent restriction phonetic leung context training percent remark tokens correct isolated isolated continuous continuous table performance tasks synchrony spectral infor mation stands phonetic contexts contexts speech material spoken continuously data train network internal representation understand network makes input information exam connection weights network vector formed extracting connections hidden units output unit shown figure process repeated output units obtain total vectors correlations vectors examined measuring angles figure shows distribution angles network trained function number hidden units circles represent distribution vertical bars stand standard deviation number hidden units increases distri bution concentrated vectors increasingly orthogonal correlations connection weights training examined shown figure comparing parts figure distributions training overlap number hidden units increases hidden units distributions similar leads connection weights hidden output layer trained sufficient number hidden units figure shows performance recognizing vowels differ techniques train connections network connections hidden output layers random initialization train connections input hidden layers connections input hidden layers train connections hidden output layers hidden units training connections input hidden layers achieves performance training connections network error backpropagation phonetic classification number hidden units training connections input hidden layer achieve higher performance training connections hidden output layer figure compares training techniques vowels resulting output units similar characteristics parts figure output layer hidden layer layer number hidden units number hidden units figure correlations vectors hidden output layers examined distribution angles vectors training distribution angles vectors training comparisons traditional techniques appealing characteristics backpropagation probability distributions distance metrics gain insights compare traditional pattern classification techniques knearest neighbor multidimensional gaussian classifiers leung number hidden units number hidden units figure performance recognizing vowels vowels connections network trained connections input hidden layers trained connections hidden output layers trained figure compares performance results network amounts training tokens synchrony made network resulting input vectors dimensions cluster crosses corresponds performance results networks randomly initialized differently initialization fluctuation observed training size comparison perform euclidean distance metric training size times chosen proportional square root number training tokens simplicity figure shows results values experiment found performance worst found training tokens network consistently compares favorably network find distance metric achieve performance true underlying probability distribution unknown assume multi dimensional gaussian distribution experiment full covariance matrix elements avoid problems singularity obtain results large number training tokens diagonal covariance matrix nonzero elements diagonal figure network compares favorably gaussian classifiers results suggest gaussian assumption error backpropagation phonetic classification number training tokens number training tokens figure comparison values text comparison gaussian classification full covariance matrix diagonal covariance matrix cluster crosses corresponds results networks randomly initialized error metric initialization order account classification performance network explicitly introduced weighted square error metric square error weighting factors depend classifica tion performance shown rank order statistics improved simulated annealing gradient descent takes steps formance poor takes smaller smaller steps performance network improves results unit output initially saturation regions sigmoid function network randomly initialized desirable learning slow unit output saturation region sigmoid function connection weights input hidden layers initialized weights hidden unit outputs network initially turn results output values output units words units initially operate center transition region sigmoid function learning fastest call method center initialization parts figure compare learning speed performance techniques square error weighted square error center initialization effective improving learning time performance network leung number training iterations number training tokens figure comparisons learning characteristics performance results techniques point corresponds average networks initialized randomly summary experiments designed understanding backpropagation phonetic classifica tion results encouraging artificial neural networks provide effective framework utilizing knowledge speech recognition references fisher darpa speech recognition research database specifications status proceed ings darpa speech recognition workshop report february leung phonetic recognition experiments artificial neural nets phillips speaker independent classification vowels continuous speech proc international congress phonetic sciences seneft computational model peripheral auditory system appli cation speech recognition research proc icassp tokyo seneft vowel recognition based derived auditory based spectral representation proc international congress phonetic sciences
11 adaboost klaus miiller berlin germany abstract boosting methods maximize hard classification margin powerful techniques exhibit overfitting noise cases noisy data boosting enforce hard margin give weight outliers leads dilemma nonsmooth fits overfitting propose algorithms soft margin classification introducing regularization slack variables boosting concept regularized versions linear quadratic programming adaboost experiments show usefulness proposed algorithms comparison soft margin classifier support vector machine introduction boosting ensemble methods success plications noise cases lines explanation proposed candidates explaining functioning boosting meth breiman proposed boosting bagging effect takes place reduces variance effectively limits capacity system freund show boosting classifies large margins error function boosting written function margin boosting step minimize function maximizing margin recently studies noisy patterns shown boosting overfit noisy data holds boosted decision trees nets kinds classifiers boosting methods overfit fact boosting maximize margin argument understand boosting necessarily overfit noisy patterns overlapping distributions give asymptotic ments statement section hard margin smallest margin plays central role causing overfitting propose relax hard margin classification soft margin classifier concept applied support vector machines successfully permanent address communication information research tokyo japan view margin concept central understanding port vector machines boosting methods clear optimal margin distribution learner achieve optimal classification noisy case data noise hard margin choice noisy data tradeoff data data point outlier general neural network learning strategies leads introduction regularization reflects prior problem introduce strategy analogous weight decay boosting strategy slack variables achieve soft margin section numerical experiments show validity regularization approach section finally conclusion adaboost algorithm ensemble hypotheses defined input vector weights satisfying binary classification case output class labels ensemble generates label weighted majority votes order train ensemble hypotheses algorithms proposed bagging weighting simply weighting scheme give description special form arcing equivalent adaboost binary classi fication case define margin inputoutput pair correct class predicted margin positive margin increases decision correctness larger adaboost maximizes margin asymptotically minimizing function margin starting note unnormalized weighting hypothesis simply normalized version order find hypothesis learning examples weighted iteration bootstrap weighted sample train alternatively weighted error function weighted weights computed training error computed hypothesis find weight optimize parameter line search direct computing weights equivalent update rule adaboost directly analytic minimization interestingly write gradient respect margins weighted minimization give hypothesis approximation hypothesis obtained minimizing directly note weighted minimization bootstrap weighted necessarily give minimized adaboost approximate gradient descent method minimizes asymptotically hard margins decrease predominantly achieved improvements margin margin negative error takes additionally amplified adaboost decrease negative margin efficiently improve error asymptotic case number iterations large values case values small differences differ ences amplified strongly function sensitive small differences margins margins training patterns margin area boundary area classes asymptotically converge takes adaboost learning hard competition case pattern smallest margin high weights patterns effectively neglected learning process order confirm reasoning correct shows margin distributions adaboost tions noise levels generated uniform distribution left figure apparent margin distribution asymptotically makes step fixed size margin training patterns margin area previous studies observed patterns exhibit large overlap support vectors support vector machines results support theoretical asymptotic analysis property adaboost produce margin area pattern area hard margin lead generalization ability true stability figure margin distributions adaboost left noise levels fixed number base typical overfitting behaviour generalization error function number iterations middle typical decision line generated adaboost networks case noise centers smoothed adaboost training patterns classification input noise experiments noisy data observed adaboost made overfitting high number boosting iterations middle shows typical overfitting behaviour generalization error adaboost boosting iterations eralization performance achieved quinlan grove observed overfitting generalization performance adaboost worse single classifier data classification noise reason overfitting increasing noisy patterns labelled asymptotically unlimited influence decision line lead overfitting reason classification hard margin means training patterns asymptotically correctly classified capacity limitation presence noise concept decision line bayes give training error large hard margins noisy data produce hypotheses complex problem soft margins changing error function order avoid overfitting slack variables similar support vector algorithm adaboost training patterns nonnegative minimum margin patterns fact adaboost produces high weights difficult training patterns enforcing nonnegative margin pattern including outliers property eventually lead overfitting observed introduce variables slack variables inequalities positive training pattern high weights previous iterations increasing force outliers classified possibly wrong labels errors sense tradeoff margin importance pattern training process depending constant choose original adaboost algorithm retrieved chosen high data adopt prior weights large weights analogy weight decay choose cumulative weight pattern previous iterations call influence pattern similar lagrange multipliers svms adaboost changed easy patterns changed difficult patterns derive error function error function control tradeoff weights pattern iterations achieved margin weight pattern computed derivative subject table description algorithms dataset hypotheses weights construct loss matrix minimize minimize minimize update rule weight training pattern difficult compute weight hypothesis analytically line search procedure unique solution satisfied line search implemented efficiently realvalued outputs base hypotheses original adaboost algorithm optimizing ensemble grove shown linear programming maximize minimum margin ensemble proposed table left algorithm maximizes mini margin training achieves hard margin adaboost asymptotically small number iterations reasoning hard margin section generalize introduce slack variables algorithm table middle modification patterns lower margins lower tradeoff make margins bigger maximize tradeoff controlled constant formulation optimization problem derived support algorithm optimization objective find function minimizes functional form norm parameter vector measure complexity hypothesis ensemble learning measure plexity norm hypotheses weight vector small elements approximately equal analogy bagging high values strongly emphasized hypotheses bagging experimentally found larger complex hypothesis apply optimization principles svms adaboost algorithm table effectively linear results base hypotheses experiments order evaluate performance algorithms make single classifier original adaboost algorithm nets support vector machine kernel artificial real world datasets benchmark dataset breast cancer image ment flare sonar splice waveform problems originally binary classification problems random partition classes generate partitions training test partition train classifier test error performance averaged table adaboost table comparison methods single classifier support vector estimation generalization error datasets method bold face performance explanation text cancer image splice waveform winner nets adaptive centers conjugate gradient iterations optimize positions widths centers base hypotheses experiments combined hypotheses number hypotheses optimal adaboost optimal early stopping parameter regularized versions adaboost parameters optimized training datasets training validation find model dataset finally model parameters computed median estimations estimating parameters practice make comparison robust results reliable line shows line computed dataset average error rate classifier types divided minimum error rate subtracted resulting averaged datasets line shows probabilities method wins smallest generalization error basis exper averaged datasets experiments noisy data show results adaboost cases worse single classifier clear overfitting effect results cases adaboost single classifier single classifier wins improves results adaboost wins improves results adaboost cases established soft margin results good results hypotheses generated adaboost aimed construct hard margin generate good soft margin observe quadratic programming slightly results linear programming fact hypotheses generated sparse smaller ensemble bigger ensembles generalization ability reduction variance worse performance compared unexpected explained fixed kernel multiscale information coarse model selection worse error function algorithm noise model adaboost noise cases classes separable shown extends applicability boosting difficult separable cases applied data noisy parameters nearoptimal values parameter tested conclusion introduced algorithms overfitting problems boosting gorithms high noise data direct incorporation regularization term error function linear quadratic programming constraints slack variables essence proposal introduce slack variables regularization order soft margin classification contrast hard margin classification slack variables basically control trust data permitted ignore outliers classification generalization spirit support vector machines tradeoff maximization margin minimization classification errors slack variables experiments showed generalization performance algorithms including support vector machines conjecture unexpected result fact scaling information adaboost limitation balance trust data margin maximization cross optimal margin distribution achieve classifying noisy patterns balance errors margin sizes optimally future works plan establish connections adaboost acknowledgements valuable discussions smola partial funding project grant number acknowledged breast cancer domain obtained university medical centre inst providing data references bishop neural networks pattern recognition breiman bagging predictors machine learning breiman arcing classifiers tech berkeley stat dept breiman prediction games arcing algorithms tech berkeley cortes vapnik support vector network mach learn schapire singer improved boosting algorithms predictions proc grove boosting limit maximizing margin learned ensembles proc conf lecun learning algorithms classification hand written digit neural networks pages miiller asymptotic analysis adaboost binary classification case proc april quinlan boosting firstorder learning proc work shop algorithmic learning theory springer soft margins adaboost august royal college technical report submitted machine learning schapire freund bartlett boosting margin effectiveness voting methods mach learn bengio neural networks application line character recognition springer vapnik nature statistical learning theory springer
8 bound error cross validation approximation estimation rates consequences split michael kearns research abstract introduction analyze performance cross validation context model selection complexity regularization work setting choose number parameters hypothesis function response finite training sample goal minimizing resulting generalization error large interesting literature cross validation methods emphasizes asymptotic statistical properties exact calculation generalization error simple models approach primarily inspired sources work barron cover introduced idea bounding error model selection method case minimum description length principle terms quantity index work vapnik provided extremely powerful general tools uniformly bounding deviations training generalization errors combine methods give general analysis cross validation perfor mance formal part paper give rigorous bound error cross validation terms parameters underlying model selection problem approximation rate estimation rate experimental part paper investigate implications bound choosing fraction data testing cross validation interesting aspect analysis identification qualitative properties optimal invariant wide class model selection problems target function complexity small compared sample size performance cross validation insensitive choice importance choosing optimally increases optimal decreases target function complex relative sample size single fixed works optimally wide range target function complexity formalism model selection problem choosing number parameters hypothesis function tuning parameters training sample steps process settings tuning parameters determined fixed learning algorithm backpropagation model selection reduces problem choosing architecture adopt idealized version division assume nested sequence function classes called structure class boolean functions parameters conflict accepted usage statistics term cross validation simple method saving independent test perform model selection precise definitions stated shortly kearns function mapping input space simplicity paper assume vapnikchervonenkis dimension class remove assumption simply replaces occurrences bounds dimension assume learning algorithm input training sample output hypothesis function minimizes training error fraction examples label situations training error minimization computationally intractable leading researchers investigate heuristics backpropagation extent theory presented applies heuristics depend part extent approximate training error minimization problem consideration model selection problem choosing precisely assume arbitrary target function function classes structure input distribution define generalization error function training sample consisting random examples drawn labeled labels possibly corrupted noise process randomly label independently probability goal minimize generalization error hypothesis selected paper make mild assumption structure property sample size labeled sample examples call function fitting number structure fitting number simple notion parameters training data perfectly property held sufficiently powerful function classes including multilayer neural networks typically expect fitting number linear function worst polynomial significance fitting number reasonable model selection method choose simply adds complexity reducing training error paper concentrate simplest version cross validation choose parameter determines split training test data input sample examples subsample consisting examples subsample consisting cross validation giving entire sample give smaller sample resulting sequence increasingly complex hypotheses hypothesis obtained training examples implies values smaller fitting number introduce cross validation chooses satisfying error subsample notice cross validation variants make efficient sample analyses require independence test emerge apply sophisticated variants denote generalization error hypothesis chosen cross validation input sample random examples target function depends structure noise rate bounding expression high probability probability sample small fixed constant results stated parameter cost factor bounds terms expected approximation rate apparent nontrivial bound account measure complexity unknown target function correct measure complexity obvious barron covers analysis performance bound error cross validation context density estimation propose approximation rate natural measure complexity relation chosen structure define approximation rate function function tells generalization error achieved class nonincreasing function sufficiently large means target function respect input distribution realizable class coarse measure complex generally rate decay nice indication representational power gain respect increasing complexity models missing means determining extent representational power realized training finite sample size added shortly give examples approximation rate examine general bound intervals problem problem input space real interval class structure consists boolean step functions steps function partitions interval disjoint segments necessarily equal width assigns alternating positive negative labels segments input space onedimensional structure arbitrarily complex functions easily verified assumption dimension holds fitting number obeys suppose input density uniform suppose target function function alternating segments equal width lies class refer settings intervals problem approximation rate figure perceptron problem problem input space large natural number class consists perceptrons inputs weights nonzero input density symmetric instance uniform density unit ball target function function nonzero weights equal shown approximation rate figure power decay addition specific examples study natural parametric forms determine sensitivity theory plausible range behaviors approximation rate important practice expect precise knowledge depends target function input distribution work barron shows bound case neural networks hidden layer squared error generalization measure measure target function complexity terms fourier transform condition approximation rates form parameter representing degree respect structure parameters capturing rate decay figure estimation rate fixed function estimation rate bound high probability sample usual result training error minimization simply bounds deviation training error generalization error note bound depend complicated elements problem structure recent work statistical physics theory learning curves wide variety behaviors deviations assume natural problems bounds give straightforward generalizations realvalued function learn squared error examining behavior setting reasonable kearns convenient accurate rely universal estimation rate bound provided powerful theory uniform convergence structure function estimation rate bound depending details problem omit factor refine behavior function smoothly behavior small large interesting important qualitative claims predictions make invariant long deviation power important recognize model cases power behavior violated note universal estimation rate bound holds assumption training sample noisefree straightforward generalizations exist instance training data corrupted random label noise rate universal estimation rate bound bound theorem structure dimension target function input distribution proximation rate function structure respect estimation rate bound structure respect high probability fraction training sample testing fitting number universal estimation bound rate weak assumption polynomial obtain high probability straightforward generalizations bounds case data corrupted classification noise obtained modified estimation rate bound section delay proof theorem full paper space considerations central idea appeal uniform convergence arguments class bound generalization error resulting training error minimizer time bound generalization error minimizing error test examples bounds expression analogous barron covers index final term bounds represents error introduced testing phase cross validation bounds exhibit tradeoff behavior respect parameter approach sample training estimation rate bound term decreasing test error term increasing data accurately estimate reverse phenomenon occurs approach theorem potentially interpretation step precisely suppose main effect classification noise rate replacement occurrences bound sample size smaller effective sample size bound error cross validation assume bound approximation actual behavior principle optimize bound obtain addition assumptions involved main good approximation error deviations analysis carried information expect practice exact form approximation rate function depends argue coming sections interesting qualitative phenomena choice largely invariant wide range natural behaviors case study intervals problem begin performing suggested optimization intervals problem recall approximation rate complexity target function analyze behavior obtained assuming estimation rate behaves omitting factor universal bound simplify formal analysis changing qualitative behavior replace term weaker define function equation approximating step analysis differentiate respect discover minimizing step differentiate respect shown details omitted optimal choice assumptions important remember point fact derived precise expression assumptions approximations made constants quantitative interpretation expression meaningless expect expression captures qualitative optimal amount data relation target function complexity score situation initially appears function sensitive ratio expect knowing practice interesting entire story figure plot function function values note consistency experimental plots axis plot training fraction observe important qualitative phenomena list order increasing small compared predicted error insensitive choice function wide fiat indicating wide range yielding essentially nearoptimal error larger comparison fixed sample size relative superiority values pronounced large values progressively worse increases plots choice result error achieved predicted yield greatly suboptimal error note large bound predicts large error values choice irrelevant small compared yield good performance wide range values essentially case large nontrivial generalization choosing important small case decreasing increases slightly difficult confirm plot precise expression hidden constants notation bounds relative weights estimation test error terms important choosing constants equal reasonable choice terms bound kearns figure plot results experiments labeled random samples size generated target function equal width intervals samples corrupted random label noise rate sample program performing training error minimization remaining examples select cross validation plots show true generalization error selected cross validation function generalization error computed problem point plots represents average trials obvious significant quantitative differences experimental plots theoretical predictions figure properties data figure small compared wide range acceptable appears choice yields optimal generalization error time sensitivity considerably pronounced choice results suboptimal performance important close complexities single approximately performs optimally entire range examined property optimal decreases target function complexity increased relative fixed experimental results effect simply small verified interesting verify prediction experimentally problem predicted effect pronounced conclusions cases approximation rate obeys power decay derived perceptron problem discussed section behavior function predicted theory largely figure full paper describe realistic experiments cross validation determine number training epochs figures similar figures obtained rough accordance theory summary theory predicts significant quantitative differences behavior cross validation arise model selection problems properties present wide range problems behavior bounds exhibits properties wide range problems interesting identify natural problems properties strongly violated potential source problems underlying learning curve classical power behavior acknowledgements give mansour andrew dana conversations cross validation model selection additional andrew conducting experiments references barron universal approximation bounds superpositions function ieee transactions information theory barron cover minimum complexity density estimation ieee transactions information theory haussler seung tishby learning curve bounds statistical mechanics proceedings seventh annual computational learning theory pages seung sompolinsky tishby statistical mechanics learning examples vapnik estimation dependences based empirical data springerverlag york vapnik chervonenkis uniform convergence relative frequencies events probabilities theory probability applications bound error cross validation approximation rates error bound intervals slice train size noise figure plots approximation rates intervals problem target complexity intervals linear plot intersecting perceptron problem target plexity nonzero weights nonlinear plot intersecting power figure plot predicted generalization error cross validation intervals model selection problem function fraction data training plot fraction training data left fixed sample size plots show error predicted theory target function complexity values bottom plot plot figure experimental plots cross validation generalization error intervals problem function training size experiments target complexity values bottom plot plot shown point represents performance averaged bound figure plot predicted generalization error cross validation power case function fraction data training fixed sample size plots show error predicted theory target function complexity values bottom plot plot
6 adaptive knot placement nonparametric regression department computer science university wisconsin falls department electrical engineering university abstract performance nonparametric methods critically depends strategy positioning knots regression surface constrained topological mapping algorithm method achieves adaptive knot placement neural network based kohonens selforganizing maps present modification original algorithm knot placement estimated derivative regression surface introduction regression problems mathematical notation seek find function predictor variables denoted vector data points measurements dimensional sample space error error unknown distribution depend distribution points training arbitrary uniform distribution domain responsible correspondence telephone email goal paper show statistical considerations improve performance neural network algorithm regression order achieve adaptive positioning knots regression surface estimating employing derivative underlying function modified algorithm made flexible regions large derivative empirical investigation show modified algorithm units regions derivative large increase local knot density introduces flexibility model regions large derivative makes model biased regions overfitting observed regions problem knot location challenging problems practical implementations adaptive methods regression adaptive positioning knots regression surface typically knot positions domain chosen subset training data knots uniformly distributed fixed commonly datadriven methods applied determine number knots showed polynomial spline spaced knots approximate arbitrary function spline equally spaced knots minimization problem involved determination optimal placement knots highly nonlinear solution space convex performance recent algorithms include adaptive knot placement mars difficult evaluate analytically addition wellknown data points uniform knots located derivative function large difficult extend results nonuniform data conjunction datadependent noise estimating derivative true function optimal knot placement function estimation depends good placement knots suggests iterative procedure alternates function knot positioning steps methods effectively solve problem adaptive knot tion strategies statistically optimal local adaptive methods generalization kernel functions kernel centers data adaptive algorithm examples local adaptive methods include recently proposed models radial basis function networks regularization works networks locally tuned units applied regression problems methods seek regression estimate general form vector predictor variable coordinates center bump response function kernel type kernel width center linear coefficients determined total number knots centers general formulation assumes global optimization error training respect parameters center locations kernel width linear coefficients practically feasible error surface generally nonconvex local minima adaptive knot placement nonparametric regression practical approaches solve problem location assume identical kernel functions remaining problem finding linear coefficients solved familiar methods linear algebra gradientdescent techniques appears problem center locations critical local neural network techniques heuristics center location based statistical considerations empirical results statistical methods knot locations typically viewed free parameters model number knots directly controls model complexity alternatively impose local constraints adjacent knot locations neighboring knots move independently approach effectively implemented model selforganization kohonens selforganizing maps model units knots neighborhood relations units fixed topological structure typically grid training selforganization data points presented iteratively time unit closest data moves topological neighbors modified algorithm adaptive knot placement model applied nonparametric regression order achieve adaptive positioning knots regres sion surface technique called constrained topological mapping modification kohonens selforganization suitable regression problems units kohonen knots regression surface correspondingly problem finding regression estimate stated problem forming dimensional topological samples dimensional sample space straight forward application algorithm regression problem work presence noise training data algo rithm produce function independent variables regression problem problem overcome algorithm nearest neighbor found subspace predictor variables space present concise description algorithm standard regression training data dimensional vectors noisy observation unknown function predictor variables vector algorithm constructs dimensional topological dimensional sample space initialize dimensional topological dimensional sample space input vector dimensional sample space find closest matching unit subspace independent variables projection input vector subspace inde pendent variables projection weight vector unit discrete time step adjust units weights return learning rate neighborhood unit iteration final time step equal product training size number times initial learning rate final learning rate experiments topological distance unit matched unit initial size number units note method achieves placement units knots density training data fact units training follow standard kohonen selforganization algorithm achieve faithful approximation unknown distribution existing method place knots underlying function rapidly improved strategy knot placement takes account estimated derivative fimction problem estimating derivative function unknown suggests iterative strategy building model start crude model estimate derivative based crude model estimated derivative refine model strategy easily incorporated algorithm iterative nature specifically method model closer closer final regression model training proceeds iteration modified algorithm estimates derivative matching unit closest presented data point additional movement knots proportional estimate estimating derivative training data makes sense smoothing properties modified algorithm summarized present training sample find closest matching unit subspace independent variables data point original move matching unit neighbors presented data point original adaptive knot placement nonparametric regression estimate average derivative function matching unit based current positions units normalize average derivative interval move presented data point rate proportional estimated normalizes average derivative iterate multivariate functions gradients directions topological structure estimated step mesh approximates function unit border units neighbor neighboring units topological dimension neighboring units approximate functions gradients topological dimension values dimension averaged provide local gradient estimate knot step estimated average derivative normalized range derivative learning rate step modified equation movement flexibility region derivative large process equation equivalent units data learning rate proportional estimated derivative matched unit note influence derivative gradually increased process selforganization factor factor account fact closer closer underlying selforganization providing reliable estimate derivative empirical comparison performance algorithms original modified compared lowdimensional problems experiments algorithms training data points univariate problems data points problems training samples generated randomly drawn uniform distribution closed interval error drawn normal distribution regression estimates produced selforganized maps tested samples test generated training average residual performance measure test piecewise linear estimate function knot locations provided coordinates units trained aver residual indication standard deviation generalization error true function original modified figure unit formed original modified algorithm gaussian function true function original modified figure unit formed original modified algorithm step function gaussian function step function experiments figure show actual maps formed original modified algorithm functions clear figures algorithm units regions derivative large increase local knot density introduced flexibility model regions large derivatives result adaptive knot placement nonparametric regression model biased regions overfitting regions derivative large modified units dimension figure average residual error size dimensional step function original modified units dimension figure average residual error function size dimensional sine function compare behavior algorithms predictability data trained constant function problem smoothing pure noise regression analysis shown original algorithm handles problem quality smoothing independent number units experiments show modified algorithm performs good original respect finally functions step sine modified algorithm performs higher dimensional settings step sine results experiments summarized figure modified algorithm outperforms original algorithm note step function easily handled recursive partition techniques cart recursive methods sensitive coordinate rotation hand method performance independent affine transformation references breiman friedman olshen stone classification regression trees belmont broomhead lowe multivariable functional interpolation adaptive networks complex systems neural networks nonparametric regression kung fallside kamm editors neural works signal processing volume constrained topological mapping nonparametric regression analysis neural networks practical guide splines springerverlag friedman silverman flexible parsimonious smoothing additive modeling kohonen selforganization associative memory springer verlag edition moody darken fast learning networks locally tuned processing units neural computation poggio girosi networks approximation learning ceedings ieee
11 learning groups invariant visual perception daniel ruderman sloan center theoretical neurobiology salk institute jolla abstract important problems visual perception visual variance objects perceived undergoing transfor translations rotations scaling paper describe bayesian method learning invariances based group theory show previous approaches based firstorder taylor series expansions inputs regarded special cases group approach handling principle arbitrarily large transformations matrix exponential based generative model images derive unsupervised gorithm learning group operators input data transformations online unsupervised learning algorithm maximizes posterior probability generating training data provide experimen results suggesting proposed method learn group operators handling large translations rotations introduction fundamental problem faced biological machine vision systems recognition familiar objects patterns presence transformations translations rotations scaling importance problem recognized early visual gibson hypothesized constant perception depends ability individual computational pitts mcculloch propose method perceptual invariance knowing number approaches proposed relying temporal sequences input patterns undergoing transformations relying modifications distance metric comparing input images stored templates paper describe bayesian method learning invariances based notion contin uous group theory show previous approaches based firstorder taylor series expansions images regarded special cases group proach approaches based firstorder models account small transformations assumption linear generative model transformed images approach hand utilizes based generative model principle handle arbitrarily large transformations correct transformation operators learned bayesian principles derive online unsupervised algorithm learning group opera tors input data infinitesimal transformations groups previously research supported sloan foundation learning groups visual perception computer vision image processing question learn groups directly input data remained open experimental results suggest examined cases translations rotations proposed method learn group operators high degree accuracy allowing learned operators vision continuous transformations groups suppose point general vector element space denote transformation point point transformation operator completely actions points space suppose belongs family operators interested cases group exists mapping pairs transformations transformation associative exists unique identity transformation exists unique inverse transformation properties reasonable expect general transformations images continuous transformations made small favor properties concerned continuous tion groups groups continuity transformation operators group assumed implement continuous mapping concrete suppose parameterized single real number group continu function continuous image continuous variation results continuous variation equivalent identity transformation transformation arbitrarily close identity effect written order matrix generator transformation group macroscopic transfor mation produced chaining number infinitesimal transformations dividing parameter equal parts performing transformation turn obtain limit expression reduces matrix exponential equation initial reference input elements group written generator group related derivative suggests alternate deriving equation respect taylor series expansion transformed input terms previous input denotes relative transformation defining operator matrix rewrite equation equation previous approaches based firstorder taylor series expansions viewed special cases group model learning transformation groups goal learn generators transformation groups directly input data examples infinitesimal transformations note learning generator formation effectively remain invariant transformation assume natural temporal sequences images transformations small deterministic sets pixel independent ruderman object network transformation figure network architecture interpolation function implementation proposed proach invariant vision involving recurrent networks estimating transformations estimating object features supplies reference image transformation work locally recurrent transformation network implementing equation network computes interpolation function generate training data assuming periodic signals actual pixels universal image tions question address learn group operator simply series images vector image image transformation results previous section write stochastic generative model images assumed zeromean gaussian white noise process variance learn full exponential generafive model difficult multiple local minima restrict transformations infinitesimal higher order terms negligible rewrite equation tractable form difference image note model linear gener learned infinitesimal transformations matrix exponential model learned matrix handle larger experimental results suppose image pairs data find matrix trans generated data bayesian maximum posteriori approach gaussian priors negative posterior probability data variance zeromean gaussian prior vector form covariance matrix gaussian prior extending equation learning groups multiple image data accomplished summing datadriven term image pairs assume fixed images transformation vary experiments chosen fixed scalar values speed learning improve accuracy choosing based knowledge expect infinitesimal image transformations define entry function distance pixels entry exploit fact symmetric efficacy choice investigation generator matrix learned unsupervised manner performing gradient descent maximizing posterior probability generating data positive constant governs learning rate matrix form vector learning rule requires current image pair estimate performing gradient descent respect fixed previously learned learning process involves alternating fast estimation image pair slower adaptation generator matrix figure depicts sible network implementation proposed approach invariant vision implementation reminiscent division dorsal ventral streams primate cortex parallel networks estimating object identity estimating object transformations object network based standard linear gener ative model form matrix learned object features feature vector object perceptual constancy achieved fact estimate object identity remains stable network network attempts account transformations induced image type transformation induced estimate details estimation rule based firstorder model equation estimating small infinitesimal transformations general rule estimating larger transformations obtaining performing gradient descent optimization function generative model equation figure shows locally recurrent network implementation matrix exponential compu tation required equation experimental results training data interpolation function purpose evaluating algorithm synthetic training data randomly generated image uniformly pixel intensities transformation image image pixels continuously transform sampled discrete pixel locations infinitesimal amounts employ interpolation function make theorem signal real number uniquely sufficiently close equally spaced discrete ples assuming signal periodic theorem dimension written algebraic manipulation simplification reduced interpolation function ruderman analytical operator real imaginary learned operator real imaginary figure learned operators translations operator matrix operator pixel plot real imaginary parts eigenvalues learned matrix operator plot eigenvalues learned matrix figure shows interpolation function translate infinitesimal amount similarly rotate translate images analog addition generate images transformations interpolation function derive analytical expression operator matrix directly derivative evaluate results learning figure shows matrix infinitesimal translations images bright pixels positive values dark negative shown rows representing operator centered pixel learning translations figure shows results equation training image pairs learning generator matrix translations images randomly generated image training pair translated left pixels learning rate decreased training pair note expected translations rows learned matrix identical shift differential operator shown figure applied image location comparison learned matrix analytical matrix figure suggests learning algorithm learn good approximation true generator matrix arbitrary multiplicative scaling factor evaluate learned matrix generate arbitrary translations reference image equa tion results encouraging shown figure noticed tendency appearance artifacts translated images significant highfrequency reference image estimating large transformations learned generator matrix estimate large translations images equation optimization function local minima figure local minima tend shallow approximately unique welldefined global minimum searched global minimum performing gradient descent equally spaced starting values picked minimum estimated values convergence figure shows results estimation process learning rotations tested learning algorithm images image plane rotations training image pairs generated rotating images pixel intensities clockwise counterclockwise learned operator matrix spatial scales shown figure accuracy matrices tested learning groups figure generating estimating large transformations original reference image translated varying degrees learned generator matrix varying equation negative likelihood optimization function generative model equation estimating large translations globally minimum found gradient descent multiple starting points comparison estimated translation values actual values pairs reference translated images shown form table equation rotations shown figure case learned matrix appears rotate reference image initial position larger rotations minor artifacts edges conclusions results suggest unsupervised network learn visual invariances learning operators generators transformation groups important issue local minima avoided estimation large transformations performing multiple searches possibility coarsetofine techniques trans formation estimates obtained coarse scale starting points estimating tions finer scales possibility stochastic techniques exploit specialized optimization function figure direc tions research investigating structured priors generator matrix improve learning accuracy speed concurrent effort involves testing approach realistic natural image sequences richer variety transformations references black robust matching tracking articulated objects viewbased representation proc fourth european conference computer vision pages transformation group model visual perception perception psychophysics essen distributed hierarchical processing primate cortex cerebral cortex generative model case multiple transformations generator type transformation transformation input image ruderman initial final figure learned operators rotations initial converged values matrix rotations scales examples arbitrary rotations reference image generated learned operator matrix results shown rotations generated realvalued learning invariance transformation sequences neural computation fukushima neocognitron selforganizing neural network model mechanism pattern recognition unaffected shift position biological cybernetics gibson senses considered perceptual systems boston lecun boser denker henderson howard hubbard jackel backpropagation applied handwritten code recognition neural computation marks introduction shannon sampling interpolation theory york springerverlag signal representation processing operator groups technical report studies science technology department engineering university olshausen anderson essen multiscale dynamic routing circuit forming size object representations journal computational neuroscience olshausen field emergence receptive field properties learn sparse code natural images nature pitts mcculloch perception auditory visual forms mathematical biophysics ballard dynamic model visual recognition predicts neural response properties visual cortex neural computation ballard development localized oriented receptive fields learning code natural images network computation neural systems simard lecun efficient pattern recognition transformation distance advances neural information processing systems pages mateo morgan kaufmann publishers vision lies approach ance image vision computing
7 analysis contributions cross connected networks thomas shultz department psychology university canada abstract understanding knowledge representations neural nets difficult problem principal components analysis contributions products sending activations connection weights yielded valuable insights knowledge representations work focused correlation matrix contributions present work shows analyzing matrix contributions yields valid insights taking account weights introduction knowledge representations learned neural networks difficult understand nonlinear properties nets fact knowledge distributed units standard network analysis techniques based network connection weights hidden unit activations limited weight diagrams typically complex weights vary multiple networks trained problem analysis activation patterns hidden units limited nets single layer hidden units cross connections cross connections direct connections bypass intervening hidden unit layers increase learning speed static networks focusing linear relations lang witbrock standard feature generative algorithms cascade correlation fahlman lebiere cross connections work analyses restricted hidden unit activations partial picture networks knowledge contribution analysis shown technique multilayer cross connected nets sanger defined contribution product output weight activation sending unit sign output target input contributions potentially informative weights hidden unit activations account weight sending activation shultz elman reduce dimensionality contributions types cascadecorrelation nets shultz demonstrated contributions produced insights cascadecorrelation solutions comparable analyses contributions scaled sign output targets sanger scaling contributions signs output targets order determine contributions helped networks solution signs output targets networks error thomas shultz correction learning natural contributions analyzing knowledge representations issue correlation matrix variance covariance matrix correlation matrix diagonal pearson correlation coefficients contributions diagonal effect variables contributions standard deviation effectively ensures correlation matrix exploits variation input activation patterns ignores variation connection weights variation connection weights eliminated contributions standardized report work investigates insights network knowledge structures revealed contributions apply matrix contributions matrix contribution variances diagonal covariances contributions diagonal taking explicit account variation connection weights produce valid picture networks knowledge networks problems employed earlier work shultz elman shultz facilitate comparison results problems include continuous arithmetic comparisons involving addition multiplication distinguishing spirals nets generated cascadecorrelation algorithm fahlman lebiere cascadecorrelation begins perceptron hidden units network order reduce error hidden unit activations correlate networks current error units cascade separate layer receiving input input units previously existing hidden units default values cascadecorrelation parameters goal understanding knowledge representations learned networks variety contexts context cognitive modeling ability nets simulate psychological phenomena sufficient addition important determine network representations bear systematic relation representations employed human subjects contributions original contribution analysis began threedimensional array contributions output unit hidden unit input pattern contrast start dimensional output weight input pattern array contributions efficient technique sanger focus output hidden units identification roles specific contributions shultz elman shultz subject matrix contributions order identify main dimensions variation contributions component line data points multidimensional space goal summarize multivariate data small number components covariance variables case contributions test determine components include analysis rotation applied improve solution component scores plotted identify function component application continuous classical binary problem training patterns make contribution analysis worthwhile constructed continuous version problem dividing input space quadrants starting input values incremented steps producing input pairs partitioned quadrants input space quadrant values analysis contributions cross connected combined values quadrant values greater quadrant values quadrant values greater combined values similar binary problems quadrants positive output target problems quadrants negative output target single output unit sigmoid activation cascadecorrelation nets trained continuous nets generated unique solution hidden units taking epochs learn correctly classify input patterns generalization test patterns training excellent contributions yielded components plot rotated component scores training patterns shown figure component scores labeled respective quadrant input space components required account variance contributions figure shows component variance contributions role distinguishing quadrants positive output target negative output target fact black shapes component space cube figure white shapes bottom components represent variation input dimensions component accounted variance contributions component accounted variance contributions input pairs quadrants square shapes concentrated negative component input pairs quadrants circle shapes concentrated positive component similarly input pairs quadrants cluster negative component input pairs quadrants cluster positive component network explicitly trained represent input dimensions feature learning distinction quadrants quadrants similar results obtained nets learning continuous problem contrast correlation matrix nets yielded clear picture component separating quadrants quadrants components representing variation input dimensions shultz correlation matrix scaled contributions performed worse plots component scores indicating interactive separation quadrants clear roles individual components shultz elman standardized rotated component plotted figure plots examined determine role played contribution network hidden units play major role component distinguishing positive negative outputs application comparative arithmetic arithmetic comparison requires conclude product integers greater equal comparison integer psychological simulations neural nets make additive multiplicative comparisons enhanced interest type problem mcclelland shultz schmidt buckingham press input unit coded type arithmetic operation performed addition multiplication additional linear input units encoded integers input units coded randomly selected integer range input unit coded randomly selected comparison integer addition problems comparison integers ranged multiplication comparison integers ranged sigmoid output units coded results comparison operation target outputs represented greater result targets represented targets represented equal thomas shultz component component component figure rotated component scores continuous component scores input pairs quadrant labeled black circles quadrant white squares quadrant white circles quadrant black squares networks task distinguish pairs quadrants black shapes pairs quadrants white shapes white shapes black densely black shapes high cube component loading figure standardized rotated component continuous rotated standardized dividing standard deviation respective contribution scores analysis contributions cross connected networks training patterns addition multiplication problems randomly selected restriction correct answers greater correct answers correct answers equal constraints designed reduce natural skew comparative values high direction multiplication problems nets epochs point close training patterns hidden units generalization previously unseen test problems accurate components sufficient account variance contributions case figure displays rotated component scores components component accounting variance separated problems greater answers problems answers located problems equal answers middle addition problems component variance separated multiplication addition contributions input unit strongly component similar results obtained nets components variance sensitive variation inputs supported examination input values extreme component scores components recall inputs coded integers added multiplied negative component input positive component input component input negative positive contrast correlation matrix nets yielded picture largest components focusing input variation lesser components bits pieces separation answer types operations interactive manner shultz problems equal answers isolated components scaled contributions produced components separated answer types operations failed represent variation input integers shultz elman essentially similar advantages matrix found nets learning addition multiplication application tile twospirals problem twospirals problem requires difficult discrimination large number hidden units input space defined spirals origin times sets realvalued pairs representing spirals single sigmoid output unit coded identity spiral nets epochs master distinction hidden units nets generalized previously unseen input pairs paths spirals matrix revealed components accounted total variance contributions fourth components distinguished spiral variance rotated component scores components plotted figure diagonal line drawn figure coordinates points spiral misclassified components data points training fact learned training patterns implies exceptions picked components components variance sensitive variation inputs confirmed input values extreme component scores components component negative positive thomas shultz component component figure rotated component scores arithmetic comparison greater problems circles problems squares addition white shapes multiplication black shapes equal problems addition represented multiplication densely white shapes black overlap black shapes black squares concentrated coordinates component spiral spiral component figure rotated component scores twospirals squares represent data points spiral circles represent data points spiral analysis contributions cross connected component negative positive means indicative perfectly symmetrical representations cascadecorrelation nets achieve highly symmetrical problem data point component mirror image negative opposite signed component score component mirror image point spiral components concentrated regions spirals nets yielded essentially similar results results previous analyses twospirals problem succeeded showing clear separation spirals based scaled shultz elman shultz correlation matrices showed extensive symmetries distinction spiral clear nets encoded problems inherent symmetries unclear previous work nets information distinguish points spiral points spiral discussion problems considerable variation network solutions revealed variation numbers hidden units signs sizes connection weights spite variation present technique applying matrix contributions yielded results sufficiently abstract characterize nets learning problem knowledge representations produced analysis identify essential information trained utilize features training nature input space research earlier conclusions network contributions technique understanding network performance sanger including intractable multilevel cross connected nets shultz elman shultz current study point ways contribution yield equally valid results starting dimensional matrix output unit hidden unit input pattern focusing output unit time hidden unit time sanger preferable contributions dimensional matrix output weight input pattern efficient yields valid results characterize network small parts network scaling contributions sign output target sanger contributions contributions realistic network knowledge output targets feedforward phase produce interpretations nets knowledge representations shultz claim true terms sensitivity input dimensions operational distinctions adding multiplying plots component scores based contributions typically dense based scaled contributions revealing networks knowledge finally applying correlation matrix contributions makes sense apply matrix noted introduction correlation matrix effectively contributions identical means variances role network connection weights present results knowledge representations matrix connection weight information explicitly retained matrix differences marked difficult problems twospirals reveal nets distinguished spirals based contributions twospirals problem presented shultz clear thomas shultz matrices relative advantages matrix evident easier problems recent rapid progress study knowledge representations learned neural nets feedforward nets viewed function approximators relating inputs outputs analysis knowledge representations reveal inputs encoded transformed produce correct outputs network contributions light function approximations components emerging transformations inputs produce correct outputs helps identify nature required transformations progress expected combining matrix decomposition techniques constrained external information decompose multivariate data matrices applying analysis techniques emerging research understanding applying neural research component predict results experiments neural nets role hidden unit identified virtue association component predict unit function served component acknowledgments research supported natural sciences engineering research council canada references test number factors multivariate behavioral research fahlman lebiere cascadecorrelation learning architecture touretzky advances neural information processing systems mountain view morgan kaufmann principal component analysis berlin springer verlag lang witbrock learning spirals touretzky hinton sejnowski proceedings connectionist models summer school mountain view morgan kaufmann mcclelland parallel distributed processing implications cognition development morris parallel distributed processing implications psychology neurobiology oxford university press networks theories place connectionism cognitive science psychological science sanger contribution analysis technique assigning responsibilities hidden units connectionist networks connection science shultz elman analyzing cross connected networks cowan tesauro alspector advances neural information processing systems francisco morgan kaufmann shultz analysis contributions cross connected networks proceedings world congress neural networks hillsdale lawrence erlbaum shultz schmidt buckingham press modeling cognitive development generafive connectionist algorithm simon developing cognitive competence approaches process modeling hillsdale erlbaum principal component analysis external information subjects variables
5 segmentation online handprinted words guyon henderson bell laboratories holmdel swiss institute technology abstract paper reports performance methods segmentation strings online handprinted capital characters input strings consist time ordered sequence coordinates methods designed work mode constraint spacing characters methods neural network recognition engine approaches segmentation differ method call inseg input tation combination heuristics identify tentative segmentation points method call outseg output segmentation relies cally trained recognition engine recognizing characters identifying relevant segmentation points introduction address problem writer independent recognition handprinted words english dictionary levels difficulty recognition handprinted words illustrated figure examples extracted databases table cases spaced characters segmenting characters independently recognition process yields poor recogni tion performance motivated explore segmentation techniques guyon henderson table databases training testing words letters long letter words constrained legal english words legal english words length word dictionary data training test approx database nature size size letters short words grid english words spaced connected figure examples styles found databases line thickness basic principle segmentation present recognizer tentative characters recognition scores ultimately determine string segmentation investigated segmentation methods differ definition tentative characters similar recognition data collection device trajectory information sequence coordinates regular time intervals preprocessing technique preserves information keeping sampled sequence feature vectors trajectory guyon recognizer time delay neural network tdnn lang hinton waibel guyon output class case outputs providing score capital letters alphabet critical step segmentation process postprocessing word hypotheses character recognition scores provided tdnn purpose conventional dynamic programming algorithms addition dictionary checks solution returns list simi legal words word hypotheses subject list chosen dynamic programming algorithms segmentation relies recognizer give confidence segmentation online handprinted words scores wrong tentative characters segmentation mistake trained valid characters perform poorly task training techniques training wrong tentative characters produced segmentation engine negative exam ples additional training reduced error rates factor section describe inseg method tentative characters heuristic segmentation points expected handprinted capital letters writers separate letters method inspired similar technique optical character recog nition burges section present alternative method outseg expects recognition engine learn empirically learning examples recognize characters identify relevant segmentation points method bears similarities methods proposed keeler section compare methods present experimental results segmentation input space figure shows steps inseg process module define tentative characters tentative cuts spaces tentative characters module performs preprocessing scoring characters tdnn recognition results gathered interpretation graph module path graph found viterbi algorithm stroke detector tentative characters path search graph figure processing steps inseg method guyon henderson figure show simplified representation interpretation graph built system tentative character denoted double index tentative character starting point tentative character point denote node score letter tentative character path graph starts node ends node word starting point transitions kind allowed prevent character overlapping avoid searching complex graph perform pruning spatial relationship strokes discard tentative cuts instance strokes large horizontal overlap remaining tentative characters grouped ways form alternative tentative characters tentative characters separated large horizontal spatial interval considered grouping figure graph obtained input segmentation method grey shading recognition scores darker stronger recognition score higher recognition confidence table present results obtained tdnn recognizer guyon convolutional layers weights characters preprocessed individually network fixed dimension input segmentation output space contrast inseg outseg method rely human designed segmentation hints neural network learns recognition segmentation features examples segmentation online handprinted words tentative characters produced simply window input sequence small steps step content window tentative character successive characters overlap considerably time figure tdnn outputs outseg system grey curve path graph duration modeling word loop correctly recognized spite prevent segmentation basis figure show outputs tdnn recognizer word loop processed main matrix simplified representation tation graph tentative character numbers time direction column scores interpretations tentative character bottom line interpretation score approximates probability present input character meaningless character connections nodes reflect model character durations simple enforcing duration transitions stands letter character interpretation guyon henderson interpretation immediately character interpretation separated permits distinguishing letter duration letter repetition double path graph found viterbi algorithm fact simple pattern connections corresponds markov model ration exponential decay implemented slightly model generation duration distribution prevent character insertion experiments selected poisson distributions model character duration tdnn recognizer layers weights sequence recognition scores obtained sweeping neural network input convolutional structure tdnn identical successive calls recognizer sixth network connections tentative character consequence outseg system processes tentative characters inseg system computation time comparison results conclusions table comparison performance segmentation methods tdnn error dictionary error dictionary char word char word inseg outseg char word char word inseg outseg summarize table results obtained segmentation methods complement results obtained database database control words length english dictionary current versions inseg performs outseg outseg method handle connected letters word loop figure inseg method relies discovered people separate characters data collected hand advantage inseg method easily recognizers tdnn outseg method relies heavily convolutional structure tdnn computational efficiency comparison substituted neural network recognizers tdnn networks alternative input representations designed optical character recognition pixel inputs segmentation online handprinted words layer performs local line orientation detection orientation architecture similar layer removed local line orientation information directly extracted trajectory transmitted layer error rate tdnn orientation performs similarly dictionary orientation lower error rate tdnn improvement attributed recognition choices facilitates dictionary results date tables obtained inseg method recognizers combined voting scheme tdnn orientation comparison purposes mention results obtained commercial recognizer data notice dictionary data drawn larger dictionary commercial system results substantially commercial system absolute scale satisfactory account test data cleaned errors identified patterns written cursive totally expect outseg method work cursive handwriting exhibit trivial segmentation hints direct evidence support expectation rumelhart success version outseg work progress extend capabilities systems cursive writing table performance system comparison mention performances obtained commercial recognizer data performance commercial system dictionary marked penalized include words contained dictionary error dictionary error dictionary method char word char word acknowledgments entire neural network group bell labs holmdel discussions helpful suggestions paper jackel boser gratefully acknowledged grateful weiss yann giving neural networks inseg method indebted howard page providing comparison figures commercial recognizer experiments performed neural network boser bottou advice guyon henderson references guyon albrecht denker hubbard design neural network character recognizer touch terminal pattern recognition guyon henderson recognition based segmentation online handprinted words input output segmentation submitted pattern recognition october lang hinton time delay neural network architecture speech recognition technical report carnegiemellon university pitts waibel hanazawa hinton shikano lang phoneme recognition timedelay neural networks ieee transactions acoustics speech signal processing march burges denker jackel shortest path segmentation method training neural networks recognize character strings volume baltimore ieee burges denker recognition space neural network moody editor advances neural information processing systems denver morgan kaufmann keeler rumelhart integrated segmentation recogni tion handprinted numerals lippmann editor advances neural information processing systems pages denver morgan mann jackel boser denker graf guyon henderson howard hubbard handwritten digit recognition application neural network chips automatic learning ieee communications magazine pages november private communication rumelhart integrated segmentation recognition cursive handwriting symposium computational learning cognition princeton jersey
12 model selection support vector machines vapnik research labs bank paris france research abstract functionals parameter model selection support vector introduced based concepts span support tors rescaling feature space shown func predict choice parameters model relative quality performance parameter introduction support vector machines svms implement idea input vectors high dimensional feature space maximal margin hyperplane constructed shown training data separable error rate svms characterized radius smallest sphere training data distance hyperplane closest training vector feature space functional estimates dimension hyperplanes separating data margin perform mapping technique positive definite kernel specifies product feature space kernel radial basis function kernel free parameter generally kernels require param eters treating noisy data svms parameter penalizing training errors problem choosing values parame ters minimize expectation test error called model selection problem shown parameter kernel minimizes functional good choice model minimum functional coincides minimum test error shapes curves article introduce refined functionals choice parameters parameter kernel parameter penalizing training error produce curves reflect actual error rate model selection support vector machines paper organized section describes basics svms section introduces functional based concept span support vectors section considers idea rescaling data feature space section discusses experiments model selection functionals support vector learning introduce standard notation svms complete description training examples belong class labeled decision function coefficients obtained maximizing functional constraints constant controls tradeoff complexity decision function number training examples misclassified linear maximal margin clas highdimensional feature space data mapped nonlinear function points called support vectors distinguish call support vectors category prediction span support vectors results introduced section based leaveoneout crossvalidation esti mate procedure estimate probability test error learning algorithm leaveoneout procedure leaveoneout procedure consists removing training data element decision rule basis remaining training data testing removed element fashion tests elements training data ferent decision rules denote number errors leaveoneout procedure leaveoneout procedure unbiased estimate probability test error expectation test error machine trained examples equal expectation provide analysis number errors made leaveoneout procedure purpose introduce concept called span support vectors vapnik span support vectors results presented section depend feature space loss generality linear svms suppose solution optimization problem fixed support vector define constrained linear combinations support vectors category note define quantity call span support vector minimum distance figure figure support vectors dashed line shown empty diameter smallest sphere support vectors intuitively smaller leaveoneout procedure make error vector formally theorem holds theorem leaveoneout procedure support vector recognized incorrectly inequality holds theorem implies separable case number errors made leaveoneout procedure bounded improvement compared functional depending support vectors span diameter support vectors equal assumption support vectors change leaveoneout procedure leads theorem model selection support vector machines theorem sets support vectors categories remain leaveoneout procedure support vector equality holds decision function trained training point removed proof theorem theorem assumption support vectors change leaveoneout procedure satisfied cases proportion points violate assumption small compared number support tors case theorem good approximation result procedure pointed experiments section figure noticed larger important decision function support vector surprising removing point change decision function proportional lagrange multiplier kind result theorem derived svms threshold inequality derived span takes account geometry support vectors order precise notion important point previous theorem enables compute number errors made procedure corollary assumption theorem test error prediction leaveoneout procedure note points support vectors correctly classified leaveoneout procedure defines number errors leaveoneout procedure entire training assumption theorem constraints definition removed hyperplanes passing origin constraint removed assumptions putation span unconstrained minimization quadratic form analytically support vectors category leads closed form matrix products support vectors category similar result obtained section model selection separable separable cases rescaling mentioned functional bounds dimension linear margin clas bound tight data surface sphere training data data flat bound poor radius sphere takes account components largest deviations idea present make rescaling data feature space radius sphere stays constant margin increases apply bound rescaled data hyperplane vapnik linear svms mapping high dimensional space rescaling achieved computing covariance matrix data rescaling eigenvalues suppose data centered normalized eigenvectors covariance matrix data compute smallest data centered origin edges approximation smallest length edge direction rescaling consists diagonal transformation decision function changed transformation data fill side length functional rescaled data estimated radius ball classical theoretical works justify change norm nonlinear case note data high dimensional feature space linear subspace spanned data number training data large work subspace dimension purpose tools kernel matrix normalized eigenvectors gram matrix eigenvalues product replaced achieve diagonal transformation finally functional experiments check methods performed series experiments concerns choice width kernel linearly separable database postal database dataset consists handwritten digit size test examples split training subsets training examples task consists separating digit error bars figures standard deviations trials experiment choose optimal noisy database database dataset split randomly times training examples test examples section describes experiments model selection separable case nonseparable section shows bounds model selection separable case rescaling model selection section prediction test error derived model selection figure shows test error prediction span differ values width kernel postal database figure plots functions values database method predicts correct minimum prediction accurate curves identical horn model selection support vector machines test span sigma choice postal database span choice database figure test error prediction computation involves computing span support vector note interested inequality exact span minimizing find point stop minimization point correctly classified leaveoneout procedure turned experiments time required compute span training time extension application span concept denote hyperparameter kernel derivative computable compute analytically derivative upper bound number errors made leaveoneout procedure theorem powerful technique model selection initial approach choose width kernel minimum case values hyperparameters component exhaustive search values hyperparameters previous remark enables find optimal classical gradient descent approach preliminary results show approach previously mentioned kernel improve test error dimension rescaling section perform model selection postal database functional rescaled version figure shows values classical bound values bound predicts correct minimum reflect actual test error easily large values data input space tend mapped flat feature space fact account figure shows performing rescaling data manage tighter bound curve reflects actual test error figure vapnik sigma rescaling dimension sigma rescaling figure bound dimension values postal database shape curve rescaling similar test error figure conclusion paper introduced techniques model selection svms based span based rescaling data feature space demonstrated techniques predict optimal values parameters model evaluate relative performances values parameters functionals lead learning techniques establish generalization ability margin acknowledgments authors haffner discussions comments references burges tutorial support vector machines pattern recognition data mining knowledge discovery jaakkola haussler probabilistic kernel regression models proceedings conference statistics opper winther gaussian process classification field results leaveoneout estimator advances large margin classifiers press shawetaylor smola williamson support vector error bounds ninth international conference artificial neural networks smola kernel principal component analysis neural networks pages berlin springer lecture notes computer science vapnik statistical learning theory wiley york vapnik bounds error expectation neural computation submitted
9 responses derived centersurround inputs surprising power computation bartlett daniel ruderman department biomedical engineering university southern california angeles neuroscience program university southern california angeles abstract biophysical modeling studies previously shown cortical pyramidal cells driven strong synaptic currents andor dendritic voltagedependent chan respond strongly synapses activated spatially clustered groups optimal size comparison number synapses activated dendritic arbor nonlinear interactions giving rise cluster sensitivity property layer virtual linear hidden units dendrites implications basis learning memory classes nonlinear sensory processing present study show single neuron access excitatory inputs offcenter cells exhibits principal nonlinear response properties complex cell primary visual cortex orientation tuning coupled translation ance contrast conjecture type processing explain complex cell responses absence oriented simple cell input ruderman introduction simple complex cells visual cortex hubel wiesel simple cell receptive fields subdivided spatial summation cells type modeled linear filters thresh nonlinearity contrast complex cell receptive fields subdivided distinct group exhibit number fundamentally nonlinear behaviors including orientation tuning receptive field wider optimal larger responses thin bars thick direct violation superposition principle sensitivity light dark bars receptive field traditional model complex cell responses involves hierarchy consisting centersurround inputs drive simple cells turn provide oriented input complex cell pooling simple cells positions phases complex cell respond selectively stimulus orientation generalizing stimulus position contrast pure hierarchy involving simple cells variety recent experimental results indicating complex cells receive monosynaptic cells depend simple cell input remains unknown complex cell responses derive intracortical network depend simple cells originate directly intracellular computations previous biophysical modeling studies inputoutput function dendritic tree excitatory voltagedependent membrane mechanisms loworder polynomial function products review close match type computation energy models complex cells suggested origin complex cell responses present study tested hypothesis local nonlinear processing dendritic tree single neuron receives excitatory synaptic input unoriented centersurround cells generate nonlin complex cell response properties including orientation selectivity coupled position contrast invariance methods biophysical modeling simulations layer pyramidal cell visual cortex carried neuron biophysical parameters implementation details andor shown table dendritic modeled soma contained modified hodgkinhuxley channels peak somatic conduc dendritic membrane electrically passive synapse included nmda simulation environment michael john moore synaptic channel implementations alan mainen responses derived centersurround inputs figure layer pyramidal neuron simulations showing synaptic contacts morphology rodney douglas martin excitatory conductances table conductances scaled estimate local input resistance local epsp size approximately uniform dendritic tree inhibitory synapses modeled mapping visual stimuli dendritic tree stimulus image consisted pixel array light dark pixel background bars length width presented orientations positions image images linearly filtered receptive fields center width surround width response filtered images mapped arrays oncenter offcenter cells outputs thresholded crude model gain control random subset neurons remained active drive modeled cortical cell neuron gave rise single synapse cortical cells dendritic tree excitatory synapses originating active cells activated asynchronously synapses remained silent spatial arrangement connections cells pyramidal cell dendrites generated automatically pairs cells active presentations optimally oriented bars formed synapses nearby sites dendritic tree activity cell array optimally oriented shown frequently pairs neurons referred geometric arrangement shown correlationbased clustering achieved choosing random cell placing dendritic site randomly ruderman parameter somatic somatic synapse count stimulus frequency figure table simulation parameters choosing placing dendritic site cells case cell chosen random restart sequence cells chosen meaning synapses successfully mapped dendritic tree previous modeling work shown type clustering correlated inputs dendrites natural outcome balance synapse formation activity dependent synapse stabilization method guaranteed optimally oriented stimulus activated larger number average bars orientations turn distributions activated synapses dendrites response optimal orientations comparison orientations previous work shown synapses activated clusters dendritic arbor produce significantly larger cell responses number synapses activated dendritic tree results results series runs shown stimulus average spike rate measured period beginning spike initiated stimulus onset measure initial transient resting potential provided rough steadystate measure stimulus effectiveness spike rates runs averaged input condition orientation tuning curves thin pixels shown orientation tuning peaks sharply vertical decays slowly larger angles tuning apparent dark light bars remains independent location receptive field responses derived centersurround inputs active active inactive figure cell activities response vertical light width presented dark background large white circles active oncenter cells large dark circles active offcenter cells small gray circles inactive cells section array shown discussion results pyramidal cell driven exclusively excitatory inputs offcenter cells biophysical level capable nonlinear response property visual complex cells cells preference light dark vertical bars established manipulating spatial arrangement connections cells pyramidal cell dendrites synapses activated tested condition significantly larger responses optimal orientations explained simple elevation total synaptic activity neuron condition origin cells orientationselective response resulted nonlinear pooling large number subunits consisting pairs cells optimally oriented achieved similar results experiments variety structures including complex arrays substantially degrees receptive field overlap random subsampling array graded activity levels dendritic trees active sodium channels addition nmda channels attempted relate orientation width tuning curves detailed aspects complex cell physiology model cell interested establishing salient nonlinear features complex cell physiology feasible single cell level detailed comparisons results empirical tuning curves made model cell network exists absent normal recurrent excitatory inhibitory influences cortical network ruderman figure layout oncenter cell vertically oriented thin bars linear ideal vertical width determined suppose cell chosen random center cell location cell array vertical presented cells vertical edges active oncenter cell position active light column cells inside edge cells oncenter cells vertical column oncenter cells vertical columns distance left depending position offcenter cells columns distance edge adjacent offcenter cells distance left opposite edge cells distance neighboring columns included offcenter cell shown bottom figure optimally stimulated bars width shown width selected width bars presented stimuli experimental validation simulation results imply significant change role single neuron neocortical processing acknowledgments work funded grants national science foundation office naval research references freeman ohzawa local intracortical connections cats plasticity neurophysiol heeger normalization cell responses striate cortex neurosci visual stone conduction velocity afferents visual cortex correlation cortical receptive field properties brain hubel wiesel receptive fields binocular interaction func tional architecture cats visual cortex physiol responses derived centersurround inputs orientation degrees figure orientation tuning curves model neurons light bars receptive field light bars displaced pixels horizontally squares dark bars centered receptive field dark bars displaced pixels standard errors data area pattern thalamic control cortical layers clusteron simple abstraction complex moody hanson lippmann editors advances neural information processing systems pages morgan kaufmann mateo pattern discrimination modeled cortical neuron neural computation synaptic integration excitable dendritic tree neurophysiol information processing dendritic trees neural computation movshon velocity tuning single units striate cortex physiol lond ohzawa freeman depth tion visual cortex neurons ideally suited disparity detectors science visual cortical neurons localized spatial frequency filters ieee trans cybern wilson perception form retina striate cortex editors visual perception neurophysiological foundations pages academic press diego
12 parallel problems server interactive tool large scale machine learning charles labs park avenue room park lawrence berkeley national road berkeley abstract imagine classify data consisting tens thousands amples twenty thousand dimensional space standard machine learning algorithms describe parallel prob lems server ppserver users computers work large data sets matlab work motivated desire bring benefits scientific computing algorithms computational power machine learning researchers demonstrate usefulness system number tasks perform independent components analysis large text corpora consisting tens thousands documents making minimal original bell sejnowski matlab source bell applying techniques data previously reach leads interesting analyses data algorithms introduction realworld data sets extremely large standards machine learning community text retrieval process collections consisting tens hundreds thousands documents easily words naturally apply machine learning techniques problem size data makes difficult paper describes parallel problems server ppserver ppserver linear algebra server executes distributed memory algorithms large data sets users large data sets matlab system brings efficiency power parallel computation researchers machines maintain benefits interactive demonstrate usefulness ppserver number tasks perform independent components analysis large text corpora consisting tens thousands documents minimal original bell sejnowski matlab source bell sejnowski applying techniques datasets previously matlab computational interface routines matlab server workers workers server variables figure ppserver matlab completely transparent ppserver vari ables tied ppserver matlab maintains handles data labs object system functions ppserver variables ppserver commands reach discover interesting analyses data algorithms parallel problems server parallel problems server ppserver foundation work ppserver realization model computation large matrices platform supporting message passing interface library standard communication writing parallel code ppserver implements functions creating removing distributed matrices loading storing disk format performing elementary matrix operations matrices twodimensional single double precision arrays created ppserver functions provided matrix sections ppserver supports dense sparse matrices ppserver simple protocol requests action command arguments server executes command command complete ppserver directly called ppserver robust protocol communicating load remove execute commands package direct access information ppserver matrices package represents defining visible function names supports data users subset functions package loading defines function names finally support common parallel applying function element matrix making easier common functionality ppserver commands implemented including basic matrix operations realized functions include optimized version large scale machine learning parallel problems server function figure matlab code producing hilbert matrices influenced creates ppserver object matlab object directly communication interface plications functionality implemented interface matlab called collection matlab objects matlab language matlab programs external language transparent integration matlab front parallel problems server choice matlab influenced factors standard computing wide industry machine learning community algorithms written matlab scripts made freely scientific computing community algorithms matlab optimized languages make interaction ppserver transparent user principle typical matlab user make explicit calls ppserver current matlab programs rewritten advantage ppserver space permit complete discussion refer reader bands briefly discuss matlab scripts modification accomplished simple tion matlab object oriented features create ppserver objects automatically special object introduce matlab acts integer user typing obtains distributed parallel reader guess distributed rows columns user matrices matlab handles special distributed types exist ppserver references variables commands recognized call ppserver traditional matlab command figure shows code built function call produces hilbert matrix influenced parallel array results line creates ppserver vector places handle note behavior interfere semantics loops matlab assigns column numbers line produces ppserver matrix emulation indexing functions results correct execution line transpose operator executes line ppserver line generated ppserver elementary matrix operations makes ppserver matrix line parallel problems server tested extensively variety clusters symmetric clusters intel ppserver tested including common lisp computational performance varies depending platform clear system distinct computational advantages communication overhead experiments roughly milliseconds ppserver command negligible compared computational space advantage transparent access linear algebra algorithms applications text retrieval section demonstrate efficacy ppserver realworld machine learning problems explore ppserver text retrieval domain task text retrieval find subset collection documents relevant users information request standard approaches based vector space model document vector dimension count occurrence word collection documents matrix column document vector similarity documents product queries documents relevance documents query typical small collections thousand vectors thousand dimensional space large collections vectors hundreds thousands dimensions standard machine learning techniques exhibit predictable behavior circumstances simply scale approaches construct linear operators extract underlying topic structure documents documents queries projected smaller space compared product large matrix support enables matrix decomposition techniques extracting linear operators easily explored viola discuss standard algorithms demonstrate ppserver perform interesting analysis large datasets latent semantic indexing latent semantic indexing constructs smaller document singular decomposition eigen vectors cooccurrence matrix diagonal elements referred singular values square roots eigenvalues eigenvectors largest eigenvalues capture axes largest variation data projects documents kdimensional subspace spanned columns denoted documents queries similarly projected scores obtained simple matlab code matlab execute matlab code parallel large scale machine learning parallel problems server figure singular values collection documents terms singular values half collection computation full collection minutes processors sparse compute computes sparse matrix scores returned combined relevance obtain curves displayed matlab addition evaluating performance techniques explore charac data implementations large collections subset documents computational reasons leads question affected figure shows singular values large collection random half collection shows shape curves remarkably similar half suggests derive projection matrix half collection evaluation technique easily performed system experiments show identical retrieval performance independent components documents independent components analysis sejnowski recovers linear data unlike finds principal components finds axes statistically independent success application blind source separation problem problem observes output number microphone assumed recording linear mixture number unknown sources task recover original sources natural embedding text retrieval framework words observed microphone signals underlying topics source signals give rise figure shows typical distribution words projected axes found words close histogram shows words large positive results collection white house press documents distinct words transition values figure distribution words large magnitude axis white house text negative values group words made terms group words directly related occurs individual words group occur times context south documents policy general acts discriminating word observed viola appears finding words selects related documents words elements select elements intuitively selects documents general subject area removes specific subset documents leaving small highly related documents suggests straightforward algorithm achieve goal directly local clustering approach similar unsupervised version query analysis similar collections reveals interesting behavior large datasets attempt find unmixing matrix full rank conflict notion collections smaller subspace found experiments axes highly produce distributions conjecture axis results distribution split arbitrarily empty axes purposes axes uninformative automatic noise reduction technique applied large datasets purposes comparison figure illustrates performance algorithms including clustering techniques articles wall street journal discussion shown enables highperformance interactive parallel problems server powerful mechanism writing optimized algorithms communication protocol makes transparent integration sufficiently powerful matlab tool researchers matlab algorithms working small problems makes operate visualize large data sets demonstrated claim ppserver system apply techniques large datasets allowing analyses data algorithms implement versions diverse viola gradient descent collection documents words large scale machine learning parallel problems server figure comparison algorithms wall street journal references bell sejnowski approach blind source separation blind deconvolution neural computation walker users guide httpwww viola mimic finding optima estimating probability densities advances neural information processing systems landauer indexing latent semantic analysis journal society information science editors information retrieval data structures algo rithms prenticehall parallel programming interface press tool interactive proceedings ninth siam conference parallel processing scientific computing viola sparse high dimensional data effective retrieval advances neural information processing systems method weighting query terms retrieval proceedings conference pages framework multipleinstance learning advances neural information processing systems implementation distributed memory parallel computers preliminary proceedings mountain conference iterative methods information management tools updating indexing scheme technical report university scientific computation httpwww ppserver parallel problems server page httpwww saund applying multiple mixture model text categorization proceedings machine learning conference learning queries query zone proceedings international conference research development information retrieval
9 dynamics training information representation japan manfred opper theoretical physics university germany abstract method calculate full training process neural work introduced sophisticated methods replica trick results directly related actual number training steps results presented maximal learning rate exact description early stopping number training steps problems addressed approach introduction training guided empirical risk minimization minimize risk phenomenon called overfitting major problems neural network learning previous work developed approx description training process statistical mechanics solve problem introduce description directly dependent actual training steps result analytical curves empirical risk expected risk functions training time shown make method tractable restrict simple neural work model demonstrates typical behavior neural nets model single layer perceptron layer adjustable weights input output outputs linear interested supervised learning examples correct output define task monitor training process assume examples provided network called teacher network teacher restricted linear outputs nonlinear output function email opper learning examples attempts minimize error averaged examples called training error empirical risk fact interested minimal error averaged inputs called generalization error expected risk shown random inputs components independent means unit variance generalization error order parameters order param eters defined novelty paper average order parameters usual statistical mechanics realizations teacher realizations spherical distribution corresponds bayesian average unknown teacher study static properties model saad comments averages found appendix section introduce method briefly readers technical details reading turn directly results remainder section read proof section results presented discussed finally conclude paper summary perspective problems dynamical approach basically exploit gradient descent learning rule linear student weights linear combinations inputs algebra recursion found term round brackets defines overlap matrix geometric series solution recursion weights dynamics training hebbian initial conditions yields infinite time steps called pseudoinverse weights valid long examples linearly independent remarks case follow expression calculate behavior order parameters training process average expression appendix similarly order parameter applied identity appendix matrix algebra note point order parameters calculated assumption statistics inputs results hold thermodynamic limit trace calculated integration eigenvalues attain integrals form integrals calculated density eigenvalues determination density found recent literature calculated opper replicas krogh perturbation theory sollich matrix identities note thermodynamic limit special assumptions inputs enter calculation authors found opper maximal minimal eigenvalues remains numerical integration similarly calculate behavior training error case find recursion analog term round brackets defines matrix calculation similar matrix playing role matrix density eigenvalues matrix multiplied altogether find results case case timedependent integrals vanish remaining terms describe limit optimal convergence rate errors section discuss implications result results illustrate theoretical results describe training process compare theory simulations find good correspondence values learning rate maximal learning rate inverse maximal eigenvalue matrix consistent general result maximal learning inverse maximal eigenvalue hessian case linear perceptron matrix identical hessian approach directly related actual number training steps examine training time varies training scenarios training stopped training error reaches minimal crossvalidated early stopping terminate training generalization error starts increase dynamics training training steps figure behavior generalization error upper line training error lower line training process loading rate storage capacity overfitting occurs theory describes results simulations parameters learning rate system size gain shows exhaustive training training time diverges region overfitting occurs region early stopping shows slight increase training time guess asymptotically training steps fulfill stopping criteria precisely study behavior training step interested limit examples choose learning rate fraction maximal learning rate calculate behavior analytically find case generalization error reach asymptotic minimum rate convergence optimal case find large neglect term batch training steps optimal convergence rate results illustrated summary paper calculated behavior learning training error training process approach relates errors directly actual number training steps shown good theory describes training process results presented maximal learning rate training time scenarios early stopping learning rate chosen appropriately batch training steps reach optimal convergence rate sufficiently large opper early stop figure number training steps fulfill stopping criteria upper lines show result training stopped training error lower dotted line dashed line solid line early stopping result training stopped generalization error started increase simulation results marks parameters learning rate system size gain problems dynamical description weight decay relation dynamical approach thermodynamic description training process discussed lack space problems examined extended version work opper interesting method extended realistic models appendix identities averages teacher weight distributions statistical mechanics approach assumes distribution local fields gaussian true averages random inputs moments usual approach principle average tasks teacher realizations gaussian local fields fulfill implies identity calculated diagonal term term made expansion assuming small correlations similarly dynamics training figure behavior training steps results large training steps reach optimal convergence solid line optimal result reached faster parameters learning rate gain identity proved acknowledgment amari discussions valuable comments references avoiding overfitting finite temperature learning cross validation conference artificial neural networks edited opper exact description early stopping weight decay submitted opper dynamics learning models neural networks edited domany hemmen schulten springer krogh learning noise linear perceptron phys opper learning neural networks solvable dynamics europhys lett saad general gaussian priors improved generalization submitted neural networks sollich learning large linear perceptrons thermodynamic limit relevant real world nips
3 simultaneous classification applied speech recognition john bridle royal signals radar great abstract stephen british research labs form neural network terminals acoustic patterns class labels speaker parameters method training network tune speaker parameters speaker outlined based trick converting supervised network unsupervised mode describe experiments approach isolated word recognition based hidden markov models results improvement speakerindependent perfor mance unlabelled data performance close achieved labelled data introduction concerned emulate aspects perception stimulus ambiguous unknown lighting conditions unambiguous context stimuli fact subject unknown conditions perceptual apparatus constraints solve problem individual words ambiguous human instance word sound standard english speakers similarly room work walk pairs british english heard ambiguous knowing speaker word current automatic speech recognition systems effects frequent concentrate important aspects signal locally exploit fact unknown properties apply words bring bear task acoustic disambiguation information latent context utterance attempts construct systems persons socalled speakerindependent models decoding short sequence words imposing knowledge speech person enable adaptation small amounts speech speaker propose factor speech knowledge speakerindependent models speaker specific parameters transformation modifies models speaker parameters paper transformations easily applied input patterns interested possibility estimating parameters small amounts unlabelled speech short words longer word types models transformations simple hope general approach applicable sophisticated models transformations future highperformance speech recognition systems adaptive network approach suppose feedforward network vectorvalued knowledge relationship acoustic patterns class labels word identities speaker parameters training network difficult supply pairs values names speakers descriptive labels training network start default values feed forward backpropagate derivatives internal parameters network weights transition probabilities enforcing constraint speaker stay equal imagine copy network utterance terminals networks dealing speaker convenient implementation small number training speakers adapt vector speaker weights coded speaker identity inputs linear units network trained modes utterances speaker speaker training inputs adjusted case interest paper unknown words unknown speaker networks word values propagate produce distributions word labels technique distributions simplest case process matter utterance pick word label largest output assuming correct backpropagate derivatives common practice method large outputs encouragement bridle networks show target procedure lead hillclimbing likelihood data assumption form generator data appendix simple network illustration explored ideas simple network based figure viewed feedforward network radial minus euclidean distance squared units softmax output nonlinearity gaussian classifier covariance matrices unit diagonal training gradientbased optimisation backpropagation partial derivatives training criterion based relative entropy likelihood targets network outputs discriminative training lead results usual modelbased methods case reference points data means class simple classifier network preceded full linear transformation param eters equivalent modelbased classifier gaussian distributions arbitrary covariance matrix class linear units speaker parameters weights speaker identity inputs straight hidden units figure adaptation speaker unlabelled tokens speaker parameters transformation allowed adapt targets derived outputs targets double outputs largest outputs encouraged figure adaptation positions reference points radial units figure input points essentially reference points displaced side represent word spoken speaker adaptation based tentative classifications reference points position inputs confident consistent labels speech recognition experiments applied ideas problem short words spoken unknown speaker method works word average unknown words speaker dataset recorded previously purposes british english isolated names letters alphabet spoken times speaker speakers divided groups train test balanced initial acoustic analysis produced spectrum vectors place input patterns discussed speech pattern sequence typically place simple gaussian density gaussian densities matrix probabilities transitions class model hidden markov model word hmms softmax normalised exponential normalised exponential units linear inputs speaker inputs linear transformation feedforward network implementing simple gaussian classifier gaussian classifier network input transformation speaker inputs bias shift variable shift words mode adaptation displaced points average error rates alphabet word recognition bridle states gaussian mixture output distribution details equivalent evaluation gaussian density simple network forward alpha computation likelihood data hidden markov model calculation thought performed recurrent network special form include bayes inversion produce probabilities classes assume equal prior probabilities obtain equivalent simple network figure call place linear transformation figure constrained linear transformation based spectrum amplitude frequency channel conditions bias parameters fixed shift parameters variable shift general case parameters figure shows average word error rates types transformation numbers utterances nonadaptive case mode check power transformations test speaker utterances parameters transformation recognition performance measured parameters utterances unsupervised adaptation reduced error rates reductions errors errors reduction error rate statistically significant practically significant performance mode unsupervised mode performance limited power transformation fixed shift transformation good results words time tested talker database isolated digits collected british unsupervised speaker adaptation technique gave decrease supervised unsupervised adaptation utterances simple frontend consisting coefficients sophisticated frontend differential informa tion energy improved performance frontend frontend unsupervised adaptation technique utterances decreased conclusions results reported show simultaneous word recognition speaker made work improves performance responding speakerindependent version unknown words performance good adaptation knowl edge word identities main extensions interested nonlinear transformations learn lowdimensional effective speaker unsupervised adaptation targets motivate target trick feeding back output network target suppose classifier network output coding softmax output write output input softmax output stage input network class parameters adjust typical output output values interpretable estimates posterior probabilities step assume implicit probability density functions assuming equal prior probabilities classes simplicity bayes rule suppose networks applies classes write maximumlikelihood approach unsupervised adaptation likeli hood data equally probable distributions simpler maximise likelihood bridle likelihood training product likelihoods individual patterns turns product derivatives training independent giving supervised backprop network relative entropy based criterion squared error minimising target output minimising equivalent maximising simple gaussian network figure unsupervised adaptation plied reference points understood online gradient descent relative kmeans cluster analysis procedure vector design method kohonens feature neighbourhood constraints controller london references bridle recurrent neural network architecture hidden markov model interpretation speech communication special issue february bridle probabilistic interpretation feedforward classification work outputs relationships statistical pattern recognition editors neurocomputing algorithms architectures applications nato series systems computer science springerverlag bridle training stochastic model recognition algorithms networks lead maximum mutual information estimation parameters advances neural information processing systems morgan kaufmann bridle simultaneous speaker utterance labelling techniques proc ieee conf acoustics speech signal processing speaker adaptation speech recognition acoust amer abstract database project technical report technology
8 analog vlsi processor implementing continuous wavelet transform edwards cauwenberghs department electrical computer engineering johns hopkins university north charles street baltimore abstract present integrated analog processor realtime wavelet position reconstruction continuous temporal signals covering audio frequency range processor performs complex harmonic lation gaussian lowpass filtering parallel channels clocked rate producing multiresolution mapping logarithmic frequency scale implementation analog circuits techniques filters achieve wide linear dynamic range maintaining compact circuit size power consumption include experimental results processor characterize components separately measurements test chip introduction effective mathematical tool multiresolution analysis wavelet transform found widespread signal processing applications involving characteristic patterns cover multiple scales resolution representations speech vision wavelets offer suitable representations temporal data pertinent features time frequency domains wavelet decompositions effective representing signals neural systems present system performs continuous wavelet transform temporal onedimensional analog signals speech regard related silicon models cochlea implementing cochlear transforms multiresolution processor implemented expands architecture developed differs analog auditory processors signal components frequency band encoded signal modulated center analog vlsi processor implementing continuous wavelet transform multiplier figure systems multiplication multiplexing frequency channel subsequently lowpass filtered translating signal components center frequency frequency wavelet decomposition reconstruction analog continuoustime temporal data complex gaussian kernel formulae decomposition reconstruction center frequencies spaced logarithmic scale constant sets relative width frequency bins decomposition adjusted alter shape wavelet kernel successive decomposition reconstruction transforms yield approximate identity operation exact continuous orthonormal basis function exists architecture operations implemented systems channel real component imaginary component phase takes form sinusoidal oscillating channel center frequency lowpass filter shown figure arrangement requires precise analog sine wave generator accurate linear analog multiplier present implementation circumvent requirements binary representation modulation reference signal multiplexing multiplying multiplication analog signal binary sequence naturally implemented high precision alternates presenting input inverse output principle applied simplify harmonic modulation illustrated figure multiplier replaced analog inverter controlled binary periodic sequence representing sine wave reference binary sequence chosen approximate analog sine wave closely components high frequency removed subsequent lowpass filter assumption made high frequency components present input signal edwards cauwenberghs sine input select filter sine wavelet filter reconstruction gaussian filter figure block diagram single channel wavelet processor showing test points modulation high frequency components binary sequence produce frequency distortion components output purpose additional lowpass filter added front residual lowfrequency distortion output minimized maximizing filters placing proper constraints cutoff frequencies optimally choosing sequence reference signal accuracy achieved improves length sequence extended constraints length implied overhead required signal bandwidth power dissipation complexity implementation wavelet gaussian function reason choosing gaussian kernel ensure optimal support time frequency requirement implementing gaussian filter linear phase avoid spectral distortion nonuniform group delays free architecture analog filter number taps required accommodate narrow bandwidth required prohibitively large purpose approximate gaussian filter firstorder lowpass filters probabilistic arguments obtained lowpass filter approximates gaussian filter increasingly number stages increases implementation sections wavelet processor parallel channels integrated single cmos technology sections configured perform wavelet decomposition reconstruction block diagram channels shown figure addition separate test chip designed performs channel wavelet function test points made points input output figure channel performs complex harmonic modulation gaussian lowpass filtering defined front chip section sample time multiplexed wavelet signals reconstruction cases signal decomposition reconstruction channel removes input component removed filters result lowpass filter result passes inverted signals output passed lowpass filter architecture remove high frequency components sequence passed gaussian shaped lowpass filter cutoff frequencies filters controlled clock rates analog vlsi processor implementing continuous wavelet transform figure remainder system reconstruction output multiplier multiplier implemented multiplexing scheme driven binary sequence representing sine wave sequence samples length created base sequence reversal inversion sequence length generates wave speech applications clock derived sequence lowpass filter form produces sine wave primary optimized base sequence consists zeros allowing simple implementation address decoder bits binary sequence shown figure magnitude prime harmonic sequence approximately unity process inverting sequence simplified gray code counter produce addresses sequence small amount combinatorial logic needed achieve desired result straightforward generate addresses cosine channel phase original linear filtering filters implemented linear firstorder filter sections number firstorder sections filters number sections gaussian filter producing suitable approximation gaussian filter response frequencies interest figure figure shows firstorder lowpass section filters implemented standard figure single discretetime lowpass filter section circuit implements transfer function single pole approx located laplace domain large values parameter sampling frequency parameter fixed design stage ratio capacitors figure filters gaussian filters measured results sine wave tested accuracy sine wave modulation signal applying constant voltages test points sine wave modulation signal effectively multiplied edwards cauwenberghs sine sequence filtered sine wave output binary sine sequence simulated filtered output measured output time figure filtered sine wave output constant output multiplier filtered output test point gaussian filter figure shows idealized output test point accurately creates desired binary sequence figure shows measured sine wave filtering filter expected output simulation model capacitor ratio justified analysis figure shown resulting sine wave good agreement simulation model provided correction made capacitor ratio account large parasitic capacitances measured data filter compared desired transform simulated output shown figure takes account smaller filter gaussian filter bandwidth output directly controlled proper gaussian filter distortion sine wave ultimately smaller measured output filter gaussian filter gaussian filter tested applying signal test point measuring response test point figure shows response gaussian filter compared expected responses sets curves filter clocked clocked curves normalized plotting time relative clock frequency solid line match lowpass filter capacitor ratio fitting parameter approximately lower capacitor area ratio chip dotted line response ideal gaussian characteristic approximated cascade firstorder sections capacitor ratio figure shows measured phase response gaussian filter clock phase response approximately linear region analog vlsi processor implementing continuous wavelet transform gaussian filter response chip data clock chip data clock filter ideal response gaussian filter ideal response frequency units theoretical phase measured response frequency units figure gaussian filter transfer functions theoretical actual relative amplitude phase wavelet decomposition figure shows test chip performing wavelet transform simple sinusoidal input illustrating effects sinusoidal modulation lowpass filtering gaussian filter chip multiplier system clocked input wave approximately close center frequency signal clock rate divided typical highest frequency channel auditory application trace figure shows filtered inverted input test point middle trace shows output test point output multiplexed signal inverse bottom trace system output labeled cosine figure shows signal frequency shown cosine output phase shown demonstrates proper operation complex single channel configured wavelet decomposition addition tested full chip decomposition individual parts function properly total power consumption wavelet chip measured large fraction attributed external circuitry periphery chip conclusions demonstrated full functionality analog chip performing continuous wavelet transform decomposition chip based mixed analogdigital signal processing principles scheme accurately implemented methods advantages architecture chip increased dynamic range precise control lateral synchronization wavelet components additional advantage inherent modulation scheme potential tune channel bandwidths wide range narrow bands cutoff frequency gaussian filter center frequency independently adjustable precisely controllable parameters references guide wavelets boston edwards analog wavelet transform chip ieee intl conf edwards cauwenberghs figure scope trace wavelet transform filtered input multiplexed signal middle wavelet output bottom neural networks edwards cauwenberghs architecture analog harmonic modulation electronics letters reading understanding continu wavelet transforms wavelets timefrequency methods phase space springer verlag andreou goldstein representation analog silicon model auditory periphery ieee edwards analog vlsi implementations tory wavelet transforms circuits ieee trans circuits september analog oscillator version techniques ieee trans circuits systems july lyon mead analog electronic cochlea ieee trans acoustics speech signal proc neural network adaptive wavelets signal resentation classification optical engineering september watts lyon improved implementation silicon cochlea ieee journal solidstate circuits
10 analog vlsi neural network phase based machine vision department electrical electronic engineering hong kong university science technology clear water hong kong suite place edward road west hong kong abstract describe design fabrication test results analog cmos vlsi neural network prototype chip intended machine vision algorithms chip implements image filtering operation similar gabor filters output complex valued define phase pixel image phase robust algorithms disparity estimation ocular stereo vergence control stereo vision image motion analysis chip reported takes input image generates outputs pixel real imaginary parts output introduction gabor filters preprocessing stages tasks machine vision image processing partially motivated findings dimensional gabor filters model receptive fields orientation selective neurons visual cortex dimensional spatiotemporal gabor filters model biological image motion analysis adelson gabor filter complex valued impulse response complex exponential modulated gaussian function dimension real constants angular frequency plex exponential standard deviation gaussian analog vlsi neural network machine vision phase complex valued filter output pixel related location edges features input image pixel translating image input results phase shift gabor output authors developed phase based approaches disparity estimation binocular vergence trol stereo vision image motion analysis barron comparison barron algorithms optical flow estimation algorithm accurate tested remainder paper describes design fabrication test results prototype analog vlsi continuous time neural network implements complex valued filter similar gabor network circuit architecture prototype implements cellular neural network architecture image filtering consists array neurons called cells corre sponding pixel image processed cell outputs evolve time equation real constants input image feed back neighbouring cells outputs enables information spread globally array network unique equilibrium point outputs correspond real imaginary parts result filtering image complex valued space convolution kernel approximated gaussian function gabor filter replaced larger narrower impulse response larger bandwidth figure shows real imaginary parts dotted lines show function modulates complex exponential figure real imaginary parts impulse response circuit implementation output corresponds voltage capacitor selected circuit architecture figure sensitive effects random parameter variations considered figure resistor labels denote conductances blocks represent amplifiers labelled gains figure circuit implementation neuron circuit implementation good intuitive understanding opera tion assume input image impulse pixel circuit corresponds setting current source setting remaining current sources gains conductances chosen steady state voltages lower capacitors follow spatial distribution shown figure center peak occurs cell voltages upper capacitors follow distribution shown figure arise circuit current supplied source part current flows conductance voltage positive voltage increases resistors conductance smoothing effect voltages current flows diagonal resistor conductance positive time transconductance amplifier input draws current node negative larger voltages nodes pushed negative positive hand larger greater smoothing nodes larger ratio higher spatial frequency impulse response oscillates design cmos building blocks section describes cmos transistor circuits implement transconductance amplifiers resistors figure implement capacitors equilibrium point unique parasitic capacitances sufficient ensure circuit operates correctly transconductance amplifier transconductance amplifiers implemented circuit shown figure output current approximately ratio differential pair transistors current assumed matched current analog vlsi neural network machine decreases static errors offsets caused finite output impedance transistors saturation bias figure cmos circuits implementing resistors resistors convolution kernels implemented modulated sine cosine functions voltages positive negative respect ground potential resistors circuit floating exhibit good linearity invariance common mode offsets voltages ground potential resistor circuits require bias circuitry implemented resistor image processing tasks interested maximizing number pixels processed bias circuitry cell decrease area turn increase number cells implementable area figure shows resistor circuit satisfies requirements circuit essentially cmos transmission gate adjustable gate voltages global bias generates gate voltages cmos resistor shown left gate bias voltages distributed resistor designed transistors operate conduction region nonlinear functions gate threshold voltages transistors chosen decrease effect nonlinearity terms conductance resistors adjusted limitations physical constraints circuit realizations values realized conductance values nonnegative gains positive nonnegative implies conductance nonnegative figure shows range center frequencies normalized relative band widths achievable realization bandwidths achievable figure filter parameters implementable circuit realization test results circuit architecture cmos building blocks fabricated orbit process mosis prototype cell dimensional array fabricated square fixed transistor smallest spatial frequency bandwidths obtained addition width impulse response adjustable changing externally supplied bias current shown figure controlling transconductance amplifiers resistors designed operate currents representing input image provided transconductance amplifiers internal chip controlled externally applied voltages outputs read chip analog form common amplifiers real part impulse response imaginary part outputs cells nected turn inputs amplifier transmission gates controlled shift chip requires supplies measure impulse response filters applied input correspond middle cell array remaining inputs output voltages chip function cell number shown solid lines figure correct offsets measured output voltages inputs shown dashed lines figure offsets separated components constant offset common cells array small offset varies cell cell chip shown constant offset approximately analog vlsi neural network machine figure measurements prototype small variations standard deviation results chips constant offset primarily offset voltage amplifier small variations cell cell result parameter variations cell cell offsets transconductance amplifiers cell subtracting offsets cell outputs observe impulse response closely matches predicted theory dotted lines figure show offset corrected outputs chip shown figure solid lines shows theoretical output chip parameters chosen minimize squared error theory data chip designed signal noise ratio defined energy theoretical output divided energy error theory data similar measurements chips gave signal noise ratios measure speed chips inputs middle cell attached function generator generating square wave switching rise times output chip cell measured ranged settling times increase number cells increases outputs computed parallel settling time primarily determined width impulse response wider impulse response farther information propagate array slower settling time conclusion architecture design test results analog vlsi prototype neural network filters images convolution kernels similar gabor filter future work chip design includes chips larger cells dimensional arrays chips integrated acquire process images simultaneously investigating network chips binocular vergence control active stereo vision system acknowledgements work supported hong kong research grants council grant number references adelson bergen spatiotemporal energy models perception motion optical society america barron performance optical flow techniques proc ieee twodimensional spectral analysis cortical receptive field profiles vision research vittoz analytical transistor model valid regions operation dedicated applications analog integrated circuits signal processing measurement image velocity boston kluwer academic publishers robustness implementations filtering proc conference circuits systems image filtering cellular neural networks proceedings ieee international symposium circuits systems binocular vergence control depth reconstruction active vision image understanding disparity estima tion vision process springerverlag berlin
7 direct multistep time series prediction peter department electrical computer engineering university colorado boulder andreas weigend department computer science institute cognitive science university colorado boulder abstract paper explores application temporal difference learning sutton forecasting behavior dynamical systems real valued outputs opposed situations performance learning comparison standard supervised learning depends amount noise present data paper deterministic chaotic time series laser task direct ahead predictions experiments show standard supervised learning learning algorithm viewed linking adjacent predictions similar effect obtained sharing internal representation network compare architectures paradigms architecture separate hidden units consists individual networks direct multistep prediction tasks shared hidden units single larger hidden layer finds representation predictions steps generated data find significant difference architectures httpwww paper colors peter andreas weigend introduction santa time series prediction analysis competition weigend gershenfeld large number nonlinear techniques applied predic tion time series results neural networks neural networks poorly neural networks trained standard supervised learning network trained based differences predicted observed values series differences concerned architecture good time delay neural network architecture called finite impulse response network standard supervised learning hand views time series prediction essentially nonlinear regression fact dealing time series basically temporal difference learning hand takes approach adjusts parameters based differences successive predictions time sutton learning shown successful context games tesauro paper investigates paradigm applied task time series prediction paper organized briefly learning section focuses application learning multistep prediction time series contrasts supervised learning figure section describes architectures cost function data experiments section presents results section summarizes paper learning nonlinear direct multistep predictors idea learning errors gradient descent based predictions adjacent time contrast traditional approach errors based difference prediction observed general expression weight update rule linear case sutton learning rate adjacent predictions equivalent target weight gradient prediction time respect weights network equation present weights calculate predictions past weights calculate past gradients experiments output nonlinear connectionist network form propagating multilayer network hidden units backpropagate weight applying chain rule gradient respect hidden layer activation function variants exist forms gradients based present pair predictions continually adds gradients weighting general case weights past gradient weight shown subsequent section lead results optimal determined principles direct multistep time series prediction multistep prediction directly predict time series time steps future denotes observed past values time denoted observation vector cast multi step prediction problem framework form overlapping sequence predictions sutton ahead prediction problem form successive predictions target prediction time time series steps ahead time step form sets predictions based observation pair weight update time involves temporal difference equivalent predictions equation shows algorithm reduces algorithm predictions temporal structure revealed time actual observation pair time multistep prediction problem temporal structure exists observation vectors online differentiate algorithms figure depicts backpropagation errors learning algorithms errors generated squared difference predicted target values network training simply minimize error function based structural difference predicted simply target values figure figure learning minimizes error difference successive predictions note case noiseless time series expect difference performance learning actual values time series accurate systems output case noisy time series conjecture learning teaching signal simply noisy observable paper begin comparing performance learning noise deterministic time series supervised learning temporal difference learning figure backpropagation supervised temporal difference learning peter andreas weigend architecture data multistep prediction problem chose directly predict values time series past values comparing algorithms examine architectures compare performance realworld dataset architecture network architectures chosen compare algo rithms multistep prediction task separate hidden units architecture consists separate prediction networks forming single prediction prediction series outputs correspond predictions network input units past values time series hidden units arbitrarily chosen single linear output task shared hidden units architecture single network outputs predictions network inputs tanh hidden units cost function train squared error weighting predictions equally supervised learning case predictions compared actual values case errors calculated based successive predictions search network training batch updates update weights pass training data network training continues error crossvalidation begins increase networks trained learning weight ranges data laser data santa competition data intensity measurements laser chaotic state exhibiting dynamics competition data points training points cross validation model points testing depart competition rules order higher statistical significance results learning curves begin analysis plotting squared error normalized variance output units function training time data predictions edited weigend gershenfeld data anonymous analyses time series data sets direct multistep time series prediction learning algorithms figure case curves monotonically decreases individually fall rise plot shows tradeoffs multitask learning learned early levels learned learning case curves ordered order expect error smaller expected prediction driven prediction projected step future note curve similar paradigms error driven observed shared epochs shared epochs figure output versus training epochs training supervised learning typical runs shown architectures exhibit behavior learning rates varied training accelerate learning case error curves epochs decrease learning rate performance metric error compare performances normalized square number samples actual predicted values comparison learning prediction longer lead time larger expected difference performance focus predictions difference pronounced considered figure shows individual performances runs task direct predictions vary architecture left side shared hidden units side separate hidden units vary training large difference values significant difference result depends fact data noise main source noise quantization error analog digital converter normalized square errors start case small initial weights experiments reason large initial weights drawn uniform distribution peter andreas weigend shared hidden units separate hidden units figure direct prediction architectures test performance comparison single task multitask learning task ahead prediction wanted investigate predicting tasks versus predicting single task beneficial comparing column left side figure shows significant difference eliminate hypotheses performance limited number hidden units networks single output unit task allocated hidden units performance remains fact additional tasks hurt performance networks sufficient resources fact additional tasks performance noise data problems weigend huberman rumelhart multitask predictions exchange rates breiman friedman caruana weigend discussions multitask learning hidden units figure summary test performance architectures learning figure table summarize performance networks stated earlier networks trained outperformed networks trained earlier predictions significant performance exist architecture learning algorithm note results networks equivalent separate hidden unit network error function equivalent algorithms networks trained learning networks exhibit average performance direct multistep time series prediction prediction shared separate table summary test performance percent empirical standard deviation averaged shared error separate error figure errors versus actual errors architectures actual versus temporal difference errors temporal difference learning rule based error neighboring predictions time question arise actual errors predicted observed values vary respect errors adjacent predictions target training figure plots actual errors versus errors architectures architectures errors smaller actual errors figure architectures prediction errors influence predictions data sets curves figure expect upward slope training signaling overfitting begun conclusions explored application temporal difference learning forecasting realvalued time series opposed situations relating learning supervised peter andreas weigend learning general perspective compare analyze performance paradigms specific data deterministic chaotic laser data santa competition time series data find paradigm learning curves individual outputs depend specific architecture shared separate hidden units paradigms learning curves show larger error tradeoff individual outputs learning longest lead time considered ahead predictions difference pronounced outperforms giving network additional tasks predicting ahead intermediate steps shared hidden units change performance compared single output separate hidden units choice weight appears range plotting error versus actual error diagnostic outofsample data noisy problems performance learning comparison depends amount noise present data noise time series paper advantage learning present comparing paradigms noisy realworld data overfitting challenge acknowledgments richard sutton suggestions implementation andreas weigend acknowledges support national science foundation research grant references breiman friedman multiple outputs abstract neural networks computing snowbird april multitask connectionist learning proceedings connectionist models summer school edited mozer smolensky touretzky elman weigend hillsdale erlbaum associates weigend leaming local error bars nonlinear regression advances neural information processing systems volume francisco morgan kaufmann sutton learning predict methods temporal differences machine learning tesauro issues temporal difference learning machine learning weigend gershenfeld time series prediction forecasting future understanding past reading addisonwesley weigend huberman rumelhart predicting exchange rates connectionist networks nonlinear modeling forecasting edited city addisonwesley part implementations
7 bayesian query construction neural network models german national research center computer science germany abstract data collection costly gained actively informative data points sequential bayesian decisiontheoretic framework develop query selec tion criterion explicitly takes account intended model predictions markov chain monte carlo methods quantities approximated desired sion number data points grows model complexity modified bayesian model selection strategy proper ties versions criterion demonstrated numerical experiments introduction paper situation data collection costly real measurements technical experiments performed situation approach query learning active data selection sequential experimental design potential benefit depending previously examples input query selected systematic output obtained motivation query learning random examples redundant information concentration examples necessarily improve generalization performance bayesian decisiontheoretic framework derive criterion query struction criterion reflects intended predictions loss function limit analysis selection data point data sampled proposed procedure derives expected loss candidate inputs selects query minimal expected loss published query construction methods plutowski white sollich current approaches cohn rely information matrix parameters parameters receive equal attention influence intended model walter addition estimates valid asymptotically sian approaches berger applied neural networks mackay sollich saad relation maximum information gain discussed paper show markov chain monte carlo determine quantities selection query approach valid small sample situations procedures precision increased additional computational effort square loss function criterion reduced variant familiar integrated square error plutowski white section develop query selection criterion decisiontheoretic point view section show criterion calculated markov chain monte carlo methods discuss strategy model selection section results experiments mlps decisiontheoretic framework assume input vector scalar output distributed vector parameters conditional expected deterministic function error term suppose iteratively collected observations bayesian posterior predictive distribution prior distribution situation based data perform action result depends unknown output decisions severe effects loss function measures loss true action paper realvalued actions setting temperature chemical process select knowing input bayes principle berger follow decision rule average risk minimal risk defined distribution future inputs assumed square loss function conditional expectation optimal decision rule control problem loss larger specific critical points addressed square loss function berger expected loss action placing predictive density weighted predictive density bayesian query construction neural network models optimal decision rule average loss input derivations square loss function applied weighted square loss query sampling selection observation average risk maximally reduced unknown defines observation data determine risk perform conceptual steps candidate query future data construct sets future observations future posterior determine future posterior distribution parameters depends observed future loss assuming optimal decision rule values compute resulting loss averaging integrate quantity future trial inputs distributed future outputs yielding procedure repeated minimal average risk found local optima typical global optimization method required subsequently determine current model adequate increase complexity adding hidden units computational procedure assume real data generated regression model gaussian noise multilayer perceptton radial basis function network error terms independent posterior density case query sampling analytic derivation posterior infeasible trivial cases approximations approach employ normal approximation mackay unreliable number observations small number parameters markov chain monte carlo procedures neal generate sample parameters distri number sampling steps approaches infinity distribution simulated approximates posterior arbitrarily account range future create number generated resulting performing markov monte carlo generate sample parameters importance sampling approximate integrals function respect approximation error approaches size increases approximation future loss future loss observation trial input case square loss function transformed independent assume representative trial inputs distribution define equations final obtained averaging trial inputs reduce variance trial inputs selected importance sampling concentrate regions high current loss facilitate search minimal reduce extent random fluctuations values vector random numbers randomly selected observations defined difference neighboring inputs affected noise search procedures exploit gradients current loss future loss current loss bayesian query construction neural network models weights inputs relevance square loss function average loss conditional variance vary sample representative approximate current loss input distribution uniform term independent complexity regularization neural network models represent arbitrary mappings finitedimensional spaces number hidden units sufficiently large hornik stinchcombe number observations grows hidden units details mapping sequential proce increase capacity networks query learning white call approach method provide results consistency white bayesian approaches model selection prove case models model choice ratio popular bayes factors choose full model data show pseudo bayes factor bayesian variant crossvalidation affected difference small full posterior importance function numerical demonstration experiment tested approach small target func tion gaussian noise assumed square loss function uniform input distribution true architecture approximating model started single randomly generated observation figure future loss exploration predicted posterior future loss current loss observations left root square error prediction estimated future loss inputs selected input smallest future loss query parameter vectors generated metropolis steps simultaneously approximated current loss criterion left side figure shows typical relation measures situations future loss regions current loss posterior standard deviation prediction high queries areas high variation estimated posterior approximates target function part figure rmse prediction averaged independent experiments shown observations rmse drops sharply marked difference prediction errors resulting future loss current loss criterion averaged experiments substantial computing effort favors current loss criterion dots rmse randomly generated data averaged experi ments bayesian prediction procedure data points located critical region high variation rmse larger experiment defined target function gaussian noise standard deviation added shown left part figure mlps hidden units candidate models generated samples posterior current data started metropolis steps small values increased metropolis steps larger values network hidden units observations metropolis steps seconds workstation equation compare models optimal model calculate current loss regular grid query points assumed square loss function uniform input distribution selected query point maximal current loss determined final query point hillclimbing algorithm close true global optimum main result experiment summarized part figure bayesian query construction neural network models observations figure current loss exploration target function root square error shows averaged experiments root square error true posterior grid inputs relation sample size phases exploration distinguished figure beginning search performed queries border input area observations algorithm detail true function concentrate relevant parts input space leads marked reduction square error observations systematic part true function captured perfectly phase experiment algorithm reduces uncertainty caused random noise contrast data generated randomly sufficient information details error gradually decreases space constraints report experiments radial basis functions similar results acknowledgement work part joint project reflex german department science technology grant number alexander linden mark ring frank fruitful discussions references berger berger statistical decision theory foundations concepts methods springer verlag york neural network exploration optimal experimental design cowan nips morgan kaufmann mateo recent advances linear design bayesian model choice asymptotics exact calculations royal statistical society figure current loss upper absolute deviation true function lower observations dots hornik stinchcombe hornik stinchcombe multilayer feedforward networks universal approximators neural networks monte carlo methods wiley york mackay mackay objective functions active data selection neural computation neal neal probabilistic inference markov chain monte carlo methods tech report computer science univ toronto order probabilities uncertain conflicting dence uncertainty artificial intelligence elsevier amsterdam white plutowski white selecting concise training sets clean data ieee neural networks walter walter bayesian experimen design response optimization miller model oriented physica verlag heidelberg sollich sollich query construction entropy generalization neural network models physical review sollich saad sollich saad learning queries maximum infor mation gain problems volume white white results estima tion dependent observations nonparametric semiparametric methods statistics york cambridge univ press
3 distributed structure processing legendre department linguistics miyata computing systems center university colorado boulder paul smolensky department computer science abstract harmonic grammar legendre connectionist theory wellformedness based assumption wellformedness sentence measured harmony negative energy connectionist state assuming lowerlevel connectionist network obeys general connectionist principles unspecified construct higherlevel network equivalent function captures relevant global aspects lower level network paper extend tensor product representation smolensky fully recursire representations structured objects sentences lowerlevel network show theoretically power technique parallel distributed structure processing introduction technique presented representing recursire structures connectionist networks developed context framework harmonic grammar legendre formalism theories linguistic wellformedness involves basic levels lower level elements problem domain represented distributed patterns activity network higher level elements domain represented locally connection weights interpreted soft rules involving elements aspects central framework authors listed order legendre miyata smolensky connectionist wellformedness measure harmony negative energy model linguistic wellformedness properties lower higher levels maximized network processing previous work developed techniques deriving higher level linguistic data allowed make contact higherlevel analyses linguistic phenomenon paper concentrates aspect framework linguistic structures sentences efficiently represented processed lower level section describes method representing tree structures network extension tensor product representation proposed smolensky recursire tree structures represented tree operations performed parallel recursire tensor product representations tensor product representation structures assigns vector built representations constituents role decomposition specifies constituent structure assigning bindings strings alphabet choose role decomposition roles absolute positions string constituents bindings tensor product representation constituent binding represented tensor generalized outer product vectors representing filler role isolation represented vector fact tensor elements conveniently labelled subscripts defined simply filler role vectors straightforward case filler member simple alphabet role member simple designer representation simply specifies vectors representing elements complex cases sets sets structures turn viewed constituents turn represented tensor product representation recursive construction tensor product representations leads tensor products vectors creating rank higher elements conveniently labelled subscripts recursive structure trees leads naturally recursive construction tensor product representation analysis builds section smolensky binary trees node children techniques developed generalize immediately trees higher branching factor power binary trees success lisp basic binary tree adopting notations lisp assume simplicity terminal nodes major kind role decomposition considered smolensky roles decomposition constituent role preceded distributed recursive structure processing tree children terminal nodes labelled symbols atoms structures represent union atoms binary trees terminal nodes labelled atoms view binary tree analogy viewed strings large number positions locations relative root adopt positional roles labelled binary strings vectors position tree accessed left child child child left child root tree role decomposition constituent tree filler bound role location tree atoms respective locations vector representing recursire view binary tree sees constituents atoms subtrees left children root fully recursire role decomposition fillers atoms trees fillers original structures fully recursire role decomposition incorporated tensor product framework making vector spaces operations complex smolensky goal representation obeying tree left subtree subtree vectors representing roles recursive decomposition left children root roles represented vectors fully recursire representation obeying equation constructed positional representation assuming positional role vectors constructed recursively fully recursire role vectors vectors representing positions depth tree rank taking root depth tree represented accordance equation complication vector spaces needed accomplish recursire analysis ranks representing depths tree direct spaces rank effect long vector elements adopting definition essentially taking recursire structure implicit subscripts labeling positional role vectors mapping structure vectors legendre miyata smolensky depth depth represented tensor depth represented tree represented sequence tensor depth depths denote vector space sequences rank rank depth infinite elements added superimposed simply adding rank vector space representing trees vector operation building representation tree subtrees equation operation written denotes vector space representing atoms terms matrices multiplying vectors written equation nonzero elements matrix replacing taking tree extracting left child recursive decomposition shown smolensky section role vectors independent performed accurately operation specifically product tensor contraction vector representing tree vector general vectors dual basis role vectors equivalently vectors comprising inverse role vectors role vectors orthonormal discussed vectors role vectors operation written operation element tensor sequence report atoms represented binary vectors space representing portion depth total units depths tensor product representations exact representation embedded cheap distributed recursive structure processing replacing operation realized matrix mapping nonzero elements matrix replaced main points developing connectionist representation trees enable massively parallel processing traditional sequential implementation lisp symbol processing consists long sequence operations compose sequence operations single matrix operation adding minimal nonlinearity compose complex operations incorporating equivalent conditional branching illustrate simple motivated symbol manipulation problem transforming tree resentation syntactic parse english sentence tree representation expression meaning sentence considered syntactic structures simple active sentences form passive sentences form transformed tree represent patient verb arbitrarily complex noun phrase trees network handle arbitrarily complex marker passive network presented input tree type represented activation vector fully recursive tensor product representation developed preceding section nonzero binary vectors length coded atoms role vectors technique desired output representation tree representing filler vectors verb constituent words noun phrases roles input tree bound roles output tree transformation performed active sentence operation input tree passive sentence cars operations implemented network weight matrices connecting input units output units shown figure addition network circuit case orthonormal similarly weight matrices constructed basic matrices legendre miyata smolensky output active input figure recursive tensor product network processing passive sentence determining input sentence active passive simply computed weight matrix input tree passive sentence marker gated sigmapi connections gated setting network process arbitrary input sentences type depth limited size network properly generated correct case role assignments figure shows network processing passive sentence generating output discussion formalism developed recursive representation trees generates representations depending choice fundamental role vectors vectors representing atoms extreme trivial fully local representation connectionist unit dedicated position special case chosen canonical basis vectors vectors representing atoms chosen canonical basis vectors previous section illustrated case linearly dependent vectors atoms orthonormal vectors roles distributed elements vectors nonzero property permits representation atoms vectors usual notions symbolic computation letting similar atoms represented vectors closer dissimilar contributes savings units purely local case literal rotation role space distributed recursive structure processing demonstrate fully distributed representations capable fully local supporting massively parallel structure processing point local representations claimed connectionist implementations preserve structure representations symbolic structures capable true processing case illustrated distributed sense units corre sponding depth tree involved representation atoms depth depths separate formalism network allowing role vectors linearly dependent full accuracy generality structure processing representation greater depth fewer units case subject current research space limitations prevented describing preliminary results harmonic grammar question developed fully recursire tensor product representation lowerlevel representation embedded structures ubiquitous syntax implications measured harmony function approximation natural language case captured context free grammars subtree independent level embedding turns wellformedness captured simple equation governing harmony function weight matrix higher level grammatical rules harmonic grammar consequence numerical constant appearing soft constraint constitutes rule applies levels embedding greatly constrains parameters grammar references connectionism problem system solution doesnt work cognition connectionism cognitive architecture critical analysis cognition legendre miyata smolensky harmonic grammar formal multilevel connectionist theory linguistic wellformedness theoretical proceedings meeting cognitive science legendre miyata smolensky harmonic grammar formal multi level connectionist theory linguistic wellformedness application proceedings meeting cognitive science smolensky tensor product variable binding representation structures connectionist networks artificial intelligence
9 regression inputdependent noise bayesian treatment christopher bishop neural computing research group aston university birmingham aston abstract regression problem assumed distribution target data deterministic function inputs additive gaussian noise constant variance maximum likelihood train models corresponds minimization error function applications realistic model noise variance depend input variables maximum likelihood train models give highly biased results paper show bayesian treatment inputdependent variance coming bias maximum likelihood introduction regression problems important predict output variables estimate error bars predictions important contribution error bars arises intrinsic noise data conventional regression assumed noise modelled gaussian distribution constant variance applications realistic noise variance depend input variables general framework modelling conditional probability density function target data input vector introduced form mixture density networks bishop feed forward network parameters mixture kernel distribution jacobs special case single isotropic gaussian kernel function bishop discussed weigend generalization arbitrary covariance matrix williams approaches based maximum likelihood lead noise variance systematically adopt approximate hierarchical bayesian treatment mackay find probable interpolant probable inputdependent noise variance results maximum likelihood show bayesian approach leads significantly reduced bias order gain insight limitations maximum likelihood proach limitations overcome bayesian treatment simpler problem involving single random variable bishop suppose variable gaussian distribution unknown unknown variance sample drawn distribution goal infer values variance likelihood function approach finding variance maximize likelihood jointly intuitive idea finding parameter values rise observed data yields standard result estimate variance biased expectation estimate equal true true variance distribution generated data denotes average data sets size large effect small case regression problems generally larger number degrees freedom relation number data points case effect bias substantial problem bias regarded maximum likelihood approach estimated data fitted noise data leads variance true expression maximum likelihood expression estimate unbiased adopting bayesian viewpoint bias removed marginal likelihood computed integrating assuming fiat prior obtain regression inputdependent noise bayesian treatment maximizing respect unbiased result illustrated figure shows contours marginal likelihood conditional likelihood eval likelihood figure left hand plot shows contours likelihood function data points drawn gaussian distribution unit variance hand plot shows marginal likelihood function dashed curve conditional likelihood function solid curve contours result maximizes smaller maximizes bayesian regression regression problem involving prediction noisy variable vector input variables goal predict regression function inputdependent noise variance networks network takes input vector generates output simplicity single output variable extension work multiple outputs straightforward bishop represents regression function governed vector weight parameters network takes input vector generates output representing inverse variance noise distribution governed vector weight parameters conditional distribution target data input vector modelled normal distribution obtain likelihood function data simplification subsequent analysis obtained taking regression function linear combinations fixed basis functions mackay choose basis function network constant weights represent bias parameters maximum likelihood procedure chooses values finding joint imum give biased result regression function fits part noise data leading overestimate extreme cases regression curve passes data point estimate infinity estimated noise variance solution problem section suggested context mackay chapter order obtain unbiased estimate find marginal distribution integrated dependence leads hierarchical bayesian analysis begin defining priors parameters isotropic gaussian priors form hyperparameters stage hierarchy assume fixed probable determined shortly probable denoted found maxi regression inputdependent noise bayesian treatment posterior distribution denominator taking negative dropping constant terms obtained minimizing choice model represents linear problem easily solved standard matrix techniques level hierarchy find maximizing marginal posterior distribution term denominator found integrating model prior integral gaussian performed analytically approximation taking discarding constants minimize denotes determinant hessian matrix unit matrix function minimized standard nonlinear optimization algorithms scaled conjugate gradients derivatives easily found terms eigenvalues summary algorithm requires outer loop probable found nonlinear minimization scaled conjugate dient algorithm time optimization code requires gradient optimum found minimizing effect evolving fast timescale slow time scale maximum penalized likelihood approach consists joint nonlinear optimization posterior distribution obtained finally hyperparameters fixed maximum likelihood bayesian approaches treated equal result dependent choice maximum distribution invariant change variable results discussion bishop illustration algorithm problem involving input output noise variance dependence input variable estimated quantities noisy finite data averaging procedure generate independent data sets consisting data points model trained data sets turn tested remaining data sets networks gaussian basis functions bias width parameters chosen equal spacing centres results shown figure clear maximum likelihood results biased noise variance systematically contrast maximum likelihood bayesian maximum likelihood bayesian figure left hand plots show sinusoidal function dashed curve data generated regression function averaged training sets hand plots show true noise variance dashed curve estimated noise variance averaged data sets bayesian results show improved estimate noise variance evaluating likelihood test data distributions bayesian approach likelihood data point averaged runs overfitting problem maximum likelihood occasionally extremely large negative values likelihood estimated large regression curve passes close individual data point omitting extreme maximum likelihood average likelihood data point regression inputdependent noise bayesian treatment substantially smaller bayesian result exploring markov chain monte carlo methods neal perform integrations required bayesian analysis numerically introduce gaussian approximation evidence frame work recently mackay proposed alternative technique based gibbs sampling interesting compare approaches acknowledgements work supported epsrc grant validation verification neural network systems references bishop mixture density networks technical report neural computing research group aston university bishop neural networks pattern recognition oxford univer sity press jacobs jordan nowlan hinton adaptive mixtures local experts neural computation mackay bayesian methods adaptive models california institute technology mackay probabilistic networks models methods proceedings inter national conference artificial neural networks paris neal probabilistic inference markov chain monte carlo meth technical report department computer science versity toronto weigend learning local error bars nonlinear regression tesauro touretzky leen advances neural information processing systems volume cambridge press williams neural networks model conditional multivariate densities neural computation
4 segmentation circuits optimization constrained john harris technology cambridge abstract segmentation algorithm developed utilizing absolute penalty common quadratic functional piecewise constant constraint segmented data energy guaranteed problems local complex continuation methods find unique global minimum interpret minimized energy generalized power nonlinear resistive network continuoustime analog segmentation circuit constructed introduction analog hardware obvious advantages terms size speed cost power consumption analog chip feel constrained ping existing digital algorithms silicon times algorithms adapted ensure analog hardware analog algorithms embedded hardware simple obey natural constraints physics algorithm intuition gained continuoustime nonlinear systems algorithm paper experimentation existing analog segmentation hard ware surprisingly analog algorithms prove computer vision limited simulating analog hardware digital computer portion work part dissertation harris smoothness term deal systems stable states network unique stable state studied network minimizes fimction smoothness penalty tile familiar quadratic term intuitive reasons penalty improvement quadratic penalty piecewise constant mentation large values penalty severe means edges smoothed small values penalized quadratic case resulting surface edges complex continuation annealing methods avoid local minima computational model interest vision researchers independent hardware implications method similar constrained methods discussed platt interpretation problem minimize constraint equation instance penalty method constraint fulfilled penalty function constraint fulfilled finite unlike typical constrained optimization methods application requires exact constraints fail discontinuities fulfilled algorithm resembles techniques robust statistics field formalized huber robust estimation techniques visual cessing clear single outlier wild variations standard regular ization networks rely quadratic data constraints quadratic data constraints robust techniques tend limit influence outlier data point function method commonly reduce outlier fact tile network developed paper robust method discontinuities data interpreted line process resistive fuse networks interpreted robust methods complex influence functions analog models pointed poggio koch notion minimizing power linear networks implementing quadratic regularized algorithms replaced general notion minimizing total resistor nonlinear networks resistor characterized content defined detection techniques mapped analog hardware segmentation circuits constrained optimization nonlinear resistive network piecewise const segmentation onedimensional interpolation froin dense data model problem paper techniques generalize sparse data multiple dimensions standard technique smoothing interpolating noisy inputs minimize energy form term ensures solution close data term implements smoothness constraint parameter controls tradeoff degree smoothness fidelity data equation interpreted regularization method power linear version resistive network shown figure energy equation discontinuities numerous starting geman geman modified equation line processes successfully demonstrated piecewise smooth segmentation methods resultant energy nonconvex complex annealing methods required converge good local minima energy space problem solved probabilistic deterministic annealing techniques discontinuities successfully demonstrated analog hardware resistive fuse networks continuation methods required find good solution term energy paper cost functional necessarily relate true energy real world harris original image figure examples network simulation varying characteristic saturating resistors shows synthetic image additive gaussian noise input network network outputs shown figures simulations segmentation circuits constrained optimization figure tiny tanh circuit saturating tanh characteristic measured nodes controls conductance saturation voltage device linear resistor half power functional equation strictly convex function origin hardware software methods solution oscillations approx equation convex cost function equation derivative equation yields current equation node resistive network figure tanh construction network requires nonlinear resistor hyper tangent characteristic extremely narrow linear region harris reason element called resistor saturating resistor nonlinear element resistive network shown figure charac wellknown circuit made independent voltage resistors strictly increasing characteristics stable state computer simulations figure shows synthetic image additive gaussian noise shows simulated result mead observed network saturating resistors limited segmentation effect noise evident output curves side step started slope increased smooth noise sides step blend homogeneous region width linear region resistor reduced network segmentation properties greatly enhanced segmentation performance improves shown figure improves figure segmentation occurs curve resembles step approximates decreasing shows change output drawback network recover exact heights input steps constant fiom height input ward show amount uniform region background significant features large retain original height noise points small ratios background typically exact values heights important location discontinuities difficult construct twostage network recover exact values step heights desired scheme network control switches fuse network analog implementation mead constructed cmos saturating resistor characteristic form delta larger mental physical limitations simulation results section suggest height segmented order network saturating resistor segment order large voltage input chips typically interested segmenting images levels higher voltages required circuit shown figure builds version saturating resistor gain stage decrease linear region device device made saturate voltages smooth segment noisy depth data correlationbased stereo algorithm real images segmentation circuits constrained optimization chip input segmented step figure measured performance network step input shown left step output shown segmented step height implementing nonlinear resistors figure circuit network successfully fabricated figure shows segmentation resulted step scanned chip segmented step reduced special annealing methods convex energy minimized conclusion energy functional developed piecewise constant segmentation computational model interest vision researchers independent hardware implications convex energy minimized sharp contrast previous solutions problem complex continuation annealing methods avoid local minima interpreting lyapunov energy nonlinear circuit built demonstrated network continuoustime segmentation network analog vlsi acknowledgement work perform caltech support christof koch carver mead hughes aircraft graduate student fellowship fellowship gratefully acknowledged work extended piecewise linear regions purely piecewise constant processing discussed paper harris references poggio torre illposed problems early vision proc ieee blake visual press cambridge geman geman stochastic relaxation gibbs distribution bayesian restoration images ieee murray wright academic press harris koch twodimensional analog vlsi circuit detecting discontinuities early vision science harris koch wyatt resistive analog hardware detecting discontinuities early vision mead editor analog vlsi neural kluwer harris models early vision thesis california institute technology pasadena dept computation neural systems harris discarding outliers nonlinear network joint conference neural networks pages seattle july huber robust statistics wiley sons koch marroquin yuille analog neuronal networks early vision proc natl acad marroquin poggio probabilistic solution illposed problems computational vision statistic assoc mead analog vlsi neural systems addisonwesley general theorems nonlinear systems resistance phil platt constraint methods neural networks computer graphics dept computer science technical report california technology pasadena poggio koch analog model computation illposed prob lems early vision technical report artificial intelligence laboratory cambridge memo poggio koch illposed problems early vision fiom computational theory analogue networks proc lond robust computational vision robust methods computer vision workshop sivilotti mahowald mead realtime visual compu tation analog cmos processing arrays stanford conference large scale cambridge press
10 refractoriness neural precision michael berry molecular cellular biology department harvard university cambridge abstract relationship neurons refractory period precision response identical stimuli investigated constructed model spiking neuron combines probabilistic firing refractory period realistic refractoriness model closely reproduced average firing rate response precision retinal ganglion cell model based free firing rate exists absence refractoriness function description spiking neurons response time histogram introduction response neurons repeated stimuli intrinsically noisy order variability account response spiking neuron instantaneous probability generating action potential response variability model determined poisson counting statistics variance spike count equal spike count time rieke recent experiments found greater precision vertebrate retina berry interneuron visual system ruyter cases neurons exhibited sharp transitions silence maximal firing neuron firing maximum rate refractoriness spikes regularly spaced poisson process firing rate asked refractory period play important role neurons response precision stimulus conditions firing events retinal ganglion cells addressed role refractoriness precision light responses retinal ganglion cells recording stimulation experiments performed salamander retina isolated solution action potentials retinal refractoriness neural precision ganglion cells recorded array spike times measured relative beginning stimulus repeat spatially uniform white light projected computer monitor photoreceptor layer intensity choosing random gaussian distribution standard deviation light level corresponded vision contrast defined temporal standard deviation light intensity divided recordings extended repeats segment random qualitative features ganglion cell responses random stimulation contrast spike trains extensive periods spikes repeated trials spike trains sparse silent periods covered large fraction total stimulus time periods firing time histogram psth rose maximum firing rate time scale comparable time interval spikes argued responses viewed discrete firing events continuously varying firing rate berry general firing events bursts spike firing events cell types similar results found rabbit retina berry time figure response salamander ganglion cell random stimulation stimulus intensity units segment spike trials firing rate firing event precision discrete episodes ganglion cell firing recognized psth contiguous period firing bounded periods complete silence provide consistent firing events boundaries firing event minima psth significantly lower neighboring maxima confidence berry boundaries defined spike trial assigned firing event berry measurements timing number precision obtained spike train firing events firing event accumulated distribution spike times trials calculated statistics average time spike event standard deviation trials quantified temporal jitter spike similarly average number spikes event variance trials quantified precision spike number trials contained spikes event contribution made included calculation ganglion cell shown temporal jitter spike event small repeated trials stimulus typically action potentials timing uncertainty milliseconds temporal jitter firing events single number taking median events variance spike count remarkably approached lower bound imposed fact individual trials necessarily produce integer spike counts events ganglion cell spike trains completely characterized firing rate berry spike number precision cell assessed computing average variance events dividing average spike count quantity factor poisson process refractoriness probabilistic models spike train start simplest probabilistic models spike train poisson model measured spike times estimate instantaneous rate spike generation time written formally number repeated stimulus trials heaviside function randomly generate sequence spike trains random numbers spike time spike time found numerically solving equation including absolute refractory period order refractoriness poisson expressed firing rate product free firing rate obtains neuron refractory recovery function describes neuron recovers refractoriness johnson miller recovery function spiking spiking affected modified rule selecting spikes absolute refractory period time weight function times refractoriness neural precision refractory period exclude spiking time probability firing spike prevented refractory period higher predicted free firing rate estimated excluding neuron unable fire refractoriness restricted spike times nearest time trial restriction assumption recovery function depends time action potential notice probability obeys inequality depends refractory period refractory period figure results model spike trains absolute refractory period firing rate averaged segment circles factor measure spike number precision event triangles temporal jitter plotted versus absolute refractory shown dotted panel real data definition free firing rate generate spike trains order statistics average firing rate range values refractory compare order statistics precision model spike trains real data free rate berry calculated segment response random salamander ganglion cell shown generate spike trains firing events identified model spike trains precision calculated finally procedure repeated times refractory period figure plots firing rate circles generated model averaged entire segment random error bars equal standard deviation rate repeated sets firing rate model matches actual firing rate real ganglion cell dashed refractory periods deviation larger refractory periods small large values absolute refractory period interspike intervals real data shorter case free firing rate enhanced match observed firing rate rate approximately constant refractory periods precision dramatically figure shows factor triangles expected refractory period drops largest refractory period temporal jitter decreases refractoriness added effect large precision spike number temporal precision fact probability rises spike occurs narrower range times number precision model matches real data timing precision matches probabilistic spike generator absolute refractory period match average firing rate precision retinal ganglion cells spike train roughly free parameter relative refractory period salamander ganglion cells typically relative refractory period absolute refractory period distribution inter spike intervals ganglion cell shown absolute refractory period relative refractoriness extends include effects relative refractoriness weight values figure illustrates method determining weight function refractoriness neuron constant firing rate interspike interval distribution drop exponentially behavior curve intervals range recovery function found interspike interval distribution berry notice recovery function rises linearly reaches unity weight function shown free firing rate calculated sets spike trains generated results summarized table give close agreement real data table results relative refractory period quantity real data model firing rate timing precision number precision neural precision poisson spike generator relative refractory period reproduces measured precision similar test performed population ganglion cells yielded close agreement berry interspike interval time figure determination relative refractory period inter spike interval distribution exponential curve solid resulting recovery function average rate model firing rate time similar figure compares firing rate real neuron generated model meansquared error counting noise estimated variance standard error divided variance agreement limited finite number repeated trials figure compares free firing rate observed rate firing equal beginning firing event larger spikes occurred addition generally smoother greater enhancement times peak summary free firing rate calculated spike train computational difficulty spiking neuron advantages conjunction refractory produces correct response precision saturate high firing rates continue distinguish neurons response prove constructing models inputoutput relationship spiking neuron berry acknowledgments mike deweese conversations acknowledges support national institute acknowledges support national science foundation berry neuron time figure illustration free firing rate observed firing rate real data solid compared model dotted free rate thick shown scale thin rates time bins references berry warland structure precision retinal spike trains pnas berry refractoriness neural precision neurosci press ruyter steveninck strong bialek reliability variability neural spike trains science johnson transmission signals fiber discharge patterns acoust signals retina acquisition analysis neurosci methods miller algorithms removing distortion discharge patterns acoust rieke warland ruyter steveninck bialek spikes exploring neural code cambridge press
9 probabilistic interpretation population codes richard zemel peter dayan abstract pouget present theoretical framework population codes generalizes naturally important case population information probability distribution underlying quantity single framework analyze existing models suggest evaluate model encoding probability distributions introduction population codes information represented activities units ubiquitous brain substantial work animals andor extract information underlying encoded quantity exception anderson work case extracting single quantity study ways characterizing joint activity population coding probability distribution underlying quantity examples motivate paper place cells hippocampus freely moving rats fire animal part environment cells area monkeys firing random moving stimulus treating activity populations cells single underlying variables inadequate insufficient information uncertain place place cells locations fire multiple values underlie input distribution moving random dots motion display capture computational power representing probability distribution underlying parameters university cambridge university washington work funded mcdonnellpew afosr probabilistic interpretation population codes paper provide general statistical framework population codes understand existing methods coding probability distributions generate method evaluate methods tasks population code interpretations starting point work neural population codes neurophys finding neurons respond variables underlying stimulus unimodal tuning function gaussian char cells sensory periphery cells report results complex processing including receiving information groups cells tuning properties instance zemel analysis distinguish spaces explicit space consists activities cells population typically implicit space underlying information population encodes tuned processing basis activities referred implicit space plays explicit role determining activities figure illustrates framework measured activities lation cells operations encoding relationship activities cells underlying quantity world represented decoding information quantity extracted activities neurons generally noisy characterize encoding operations probabilistic simplest models make assumption conditional dependence units underlying quantity characterize degree correlation units coding model true bayesian decoding model specifies information carries characterized precisely prior distribution constant proportionality note starting deterministic quantity world encoding firing rates decoding operation results probability distribution uncertainty arises represented loss function extract single distribution operation attack common assumption single variable single position environment single coherent direction motion dots direction discrimination task capture experiments rats made uncertain position direction motion simultaneous motion directions natural characterization probability distribution variable extra information number dots represents information cast existing classes population codes terms framework poisson model poisson encoding model quantity encoded call activities individual units independent zemel dayan pouget encode figure left encoding maps world tuning functions ities leading observed activities assume complete knowledge variables governing systematic activities cells single space underlying variables decoding extracts single picked distribution loss function terms activity number spikes cell fixed time interval stimulus onset typical form tuning gaussian preferred cell poisson decoding model constant respect simple poisson model makes assumption single argued characterization quantity world activities cells encode describe method encoding takes definition equation good gaussian square implying gaussian unimodal worse width distribution making practical cases close approximation delta function model anderson represent probability distributions single values activities represent distribution linear basis functions normalized probability distribution kernel functions probabilistic interpretation population codes tuning functions cells commonly measured experiment neural instantiation form part structure population code probability distribu tions positive range spatial frequencies reproduce severely limited terms framework model specifies method decoding makes encoding corollary evaluating requires choice encoding representing encode kullbackleibler divergence measure discrepancy expectationmaximization algorithm treating mixing proportions mixture model relies probability distributions projection method linear filtering based alternative distance computed projection tuning functions calculated overlap substantially extended poisson model model difficulty capturing probability distribu tions include high frequencies delta functions conversely standard poisson model pattern activities rapidly approaches delta function activities increase middle ground extend standard poisson encoding model recorded activities depend general poisson statistics equation identical model equation variability built poisson statistics decoding required bayesian inverse encoding note depends stochastically full bayesian inverse distribution distributions summarize approximation member perform approximate form maximum likelihood distributions approximate piecewise constant histogram takes piecewise constant histogram values generally maximum posteriori estimate shown derived maximizing variance smoothness prior form maximize likelihood adopt crude approximation averaging neighboring values zemel dayan pouget operation extended poisson projection encode likelihood table summary operations respect framework interpretation methods compared operator ensure integer firing rates kernel functions method successive iterations comparison linear decoding method equation offers nonlinear combining activities give probability distribution underlying variable computa tional complexities equation irrelevant decoding implicit operation system perform comparing models illustrate models showing represent bimodal distributions kernel functions tuning functions extended poisson model units spaced evenly range table summarizes methods figure shows decoded version mixture broad gaussians figure shows mixture gaussians models work representing broad gaussians forms model difficulty gaussians version puts weight nearest kernel functions broad projection version rings attempt narrow components distributions extended poisson model greater fidelity discussion examined consequences seemingly obvious step instance uncertain places place cells representing places activated complications probabilistic interpretation population codes figure upper methods provide good bimodal gaussian distribution variance sufficiently large lower model difficulty structure interpretation instance longer maximum likelihood methods extract single code directly main resulting framework method encoding decoding probability distributions natural extension provably inadequate standard poisson model encoding decoding single values cells statistics determined integral probability distribution weighted tuning function cell suggested decoding model based approximation maximum likelihood decoding discretized version probability distribution showed recon broad narrow multimodal distributions accurately standard poisson model kernel density model built method units supposed poisson statistics robust noise decoding method biologically plausible quantitative lower bound activities code distribution stages processing subsequent population code extract single control behavior integrate information represented population codes form combined population code operations performed standard neural operations taking nonlinear weighted sums possibly products activities interested formation preserved operations measured zemel dayan pouget standard decoding method modeling extraction requires modeling loss function empirical evidence motion experiment electrical stimulation cells input moving stimulus works remains integrating population codes generate output form population code hinton noted directly relates notion generalized hough transforms presently studying system learn perform combination decoder targets special concern combination understand noise instance visual system sensitive detecting outputs real cells stages system apparently noisy poisson statistics noise added stage processing combination final population code faithful input current research issue creation elimination noise cortical synapses neurons issue treated certainty magnitude idea total activity population code certainty existence quantity represent attractive provided independent knowing scale total scaling idea extended poisson models fact stage interpret greater activity representing information existence multiple objects multiple motions treatment place cell system plausible absolute level activity coding familiarity location entire collection cells thing representing single quantity representing probability distribution fidelity difficult provided interpretation encoding decoding clear suggest steps direction references anderson international journal modern physics anderson essen computational intelligence life york ieee press baldi biological cybernetics dempster laird rubin proceedings royal statistical society georgopoulos schwartz science hinton scientific american newsome movshon nature okeefe brain research abbott journal computational neuroscience newsome science seung sompolinsky proceedings national academy sciences neural computation zemel hinton neural computation part implementation
11 global optimisation neural network models sequential sampling cambridge university engineering department cambridge england author cambridge university engineering department cambridge england cambridge university engineering department cambridge england andrew cambridge university engineering department cambridge england abstract propose strategy training neural networks resampling algorithms global optimisation strategy learn probability distribu tion network weights sequential framework suited applications involving online nonlinear nongaussian nonstationary signal processing introduction paper addresses sequential training neural networks powerful sampling techniques sequential techniques important applications neural works involving signal processing data arrival inherently tial adopt sequential training strategy deal signals information recent past information distant past sequentially estimate neural network models state space formulation extended filter freitas niranjan involves local output equation easily performed derivatives output respect unknown parameters approach employed authors including global optimisation neural network models sequential sampling local leading algorithm gross simplification probability densities involved nonlinearity output model induces multi modality resulting distributions gaussian approximation densities important details approach adopt paper sampling discuss resampling sequential importance sampling algorithms particle filters gordon smith pitt train multilayer neural networks state space neural network modelling start state space representation model neural networks evolution time transition equation describes evolution network weights measurements equation describes nonlinear relation inputs outputs physical process denotes output measurements input measure ments neural network weights measurements nonlinear mapping approximated multilayer perceptton measure ments assumed corrupted noise sequential monte carlo framework probability distribution noise user examples choose gaussian distribution covariance measurement noise assumed uncorrelated network weights initial conditions model evolution network weights assuming depend previous stochastic component process noise represent uncertainty parameters evolve modelling errors unknown inputs assume process noise gaussian process covariance distributions adopted choice distributions network weights requires research process noise assumed uncorrelated network weights posterior density constitutes complete solution sequential estima tion problem applications tracking interest estimate marginals filtering density computing density track complete history weights storage point view filtering density turns parsimonious full posterior density function filtering density network weights easily derive estimates network weights including modes confidence intervals sequential importance sampling sequential importance sampling optimisation framework represen samples describe posterior density function network parameters sample consists complete network parameters specifically make monte carlo approximation freitas niranjan represents samples describe posterior density denotes delta function expectations form approximated estimate samples drawn posterior density function typically draw samples directly posterior density draw samples proposal density function transform tion expectation variables importance ratios drawing samples proposal function approximate expectations interest estimate normalised importance ratios difficult show freitas niranjan assume hidden markov process initial density transition density recursive algorithms derived algorithms derive freitas niranjan shown perform neural network training extended algorithm deal multiple noise levels algorithm updating made software implementation algorithm html global optimisation neural network models sequential sampling sampling stage predict dynamics equation sample update samples equations case evaluate importance ratios importance ratios resampling stage threshold index discrete kalman gain matrix denotes identity matrix size tuning parameters roles explained detail freitas niranjan represents matrix strictly speaking approximation covariance matrix network weights resampling stage eliminate samples probability multiply samples high probability authors efficient algorithms task operations pitt carpenter freitas niranjan assess ability hybrid algorithm estimate timevarying hidden param eters generated inputoutput data logistic function linear scaling displacement shown figure simple model equivalent hidden neuron output linear neuron applied gaussian input sequences model corrupted weights output values gaussian noise trained model structure inputoutput figure logistic function linear scaling displacement weights chosen data generated model chose sampling trajectories initial weights variance process noise parameter levels shown plot figure time training samples figure noise level estimation algorithm phase time steps allowed model weights vary time phase algorithm track inputoutput training data estimate latent model weights addition assumed noise variance levels training session time step fixed values weights generated inputoutput data test sets original model input test data trained model weights values estimated time step subsequently global optimisation neural network models sequential sampling output prediction trained model compared output data original model assess generalisation performance training process shown figure noise level trajectories converged true addition track network weights obtain accurate output predictions shown figures output prediction output prediction figure step ahead predictions training phase left stationary predictions test phase time figure weights tracking performance algorithm histograms algorithm performs global search parameter space freitas niranjan conclusions paper presented sequential monte carlo approach training neural networks bayesian setting proposed algorithm makes gradient sampling information interpreted gaussian mixture filter sampling trajectories employed number trajectories increases computational requirements increase linearly method suitable sampling strategy approximating multimodal distributions research include design algorithms adapting noise covariances studying effect noise models network weights improving computational efficiency algorithms freitas supported university merit foundation research development south award college external cambridge references carpenter improved particle filter nonlinear problems technical report department statistics oxford versity england freitas niranjan bayesian kalman models regularisation sequential learning tech nical report cambridge university freitas niranjan regularisation sequential learning algorithms jordan kearns solla advances neural information systems press freitas niranjan tial monte carlo methods optimisation neural network models tech nical report cambridge university sequential methods bayesian filtering technical report cambridge university avail gordon smith approach bayesian state estimation pitt filtering simulation auxiliary particle filters technical report department statistics college london england training multilayer percepttons extended kalman algorithm touretzky advances neural information systems mateo
9 statistical mechanics experts mixture department physics university science technology email abstract study generalization capability mixture experts learn examples generated network architecture number examples smaller ical network shows symmetric phase role experts specialized crossing critical point system undergoes continuous phase transition breaking phase gating network partitions input space effectively expert assigned space find mixture experts multiple level hierarchy shows multiple phase transitions introduction recently considerable interest neural network community techniques integrate collective predictions mixture experts implements model applications efforts evaluate gener alization capability modular approaches theoretically present analytic study generalization mixture experts statistical physics perspective statistical mechanics formulation focused study feedforward neural network architectures close multilayer expect statistical mechanics approach effectively evaluate advanced archi including mixture models letter study generalization mixture variety network trained examples teacher network architecture find interesting phase transition driven symmetry breaking experts phase transition closely related mechanism mixture model originally designed accomplish statistical mechanics formulation mixture experts mixture tree consisted expert networks gating networks assign weights outputs experts expert networks leaves tree gating networks branching points tree sake simplicity network gating network experts expert produces generalized linear function dimensional input weight vector expert spherical transfer function produces binary outputs principle implemented assigning expert space input space local rules gating network makes partitions input space assigns expert weighting factor gating function heaviside step function experts gating function defines sharp boundary subspace perpendicular vector softmax function original literature yield soft boundary weighted output mixture expert written network individual experts generates binary outputs learn dichotomy rules training examples generated teacher architecture statistical mechanics mixture experts weights gating network expert teacher learning mixture experts interpreted probabilistically learning algorithm considered maximum likelihood estimation learning algorithms originated statistical methods algorithm gibbs algorithm noise level leads gibbs distribution weights long time partition function training experts gating network good generalization performance energy system defined errors examples performance network measured generalization function represents average input space generalization error defined denotes average examples denotes thermal average probability distribution replica calculation turns intractable annealed proximation annealed approximation exact high temperature limit approximation qualitatively good results case learning realizable generalization curve phase transition generalization function written function overlaps weight vectors teacher student overlap order parameters probability expert student learns examples generated expert teacher volume fraction input space positive examples expert student wrong answer probability respect expert teacher assume weight vectors teacher orthogonal overlap order parameters shown vanish symmetry properties network free energy written function order thermodynamic limit dimension input space number examples infinity keeping ratio finite minimizing free energy respect order parameters find probable values order parameters generalization error plots overlap order parameters versus temperature examining plot find interesting phase transition driven symmetry breaking experts phase transition point overlap gating networks teacher student overlaps experts symmetric symmetric phase gating network examples learn proper partitioning performance random partitioning expert student specialize subspaces local rule expert teacher expert learn multiple linear rules linear structure leads poor generalization performance critical amount examples provided strategy work crossing critical point system undergoes continuous phase tran sition symmetry breaking phase order parameter related goodness partition begins increase approaches increasing gating network partition close teacher plot order overlap experts teacher student branches approaches means expert role making pair expert teacher plots generalization curve versus scale generalization curve continuous slope curve transition point generalization curve statistical mechanics mixture experts figure overlap order parameters versus find solid line axis dashed line transition point begins increase dotted line dashed line branches approach generalization curve versus mixture experts scale transition point shown figure typical generalization error curve network continuous weight asymptotic behavior large decay observed learning feedforward networks mixture experts hierarchy study generalization hierarchical mixture experts hierarchical mixture experts consisted gating networks experts level tree divided branch turn divided branches lower level experts leaves tree gating networks lowerlevel branching points network learns training examples drawn teacher network architecture shows learning curve related phase transitions system fully symmetric phase gating networks provide correct partition experts levels hierarchy experts specialize overlaps weights teacher experts phase transition smaller related symmetry breaking gating network gating network partition input space parts lowerlevel gating network functioning properly overlap gating networks lower level tree teacher experts partially specialize groups specialization group accomplished overlap order parameter statistical mechanics mixture experts distinct values bigger overlap experts teacher group smaller experts teacher belong group transition point symmetry related lowerlevel hierarchy breaks networks work properly input space divided expert makes pair expert teacher overlap order parameters distinct values largest overlap matching expert teacher largest overlap neighboring teacher expert tree hierarchy smallest experts group phase transition result learning curve conclusion phase transition mixture experts interpreted symmetry breaking phenomenon similar observed committee machine transition continuous means symmetry breaking easier mixture experts multilayer perceptton advantage learning highly nonlinear rules existence local minima find hierarchical mixture experts multiple phase transitions related symmetry breaking levels note symmetry breaking higherlevel branch desirable property model jordan saul sompolinsky seung discussions comments work partially supported basic science special program basic science research institute references jacobs jordan hinton neural computa tion jordan jacobs neural computation pertone cooper neural networks speech image cessing london wolpert neural networks seung sompolinsky tishby phys park phys park phys baum haussler neural computation
8 sound segmentation smith computer science university abstract technique segmenting sounds processing based early auditory processing presented technique based features sound neuron spike recording suggests detected cochlear nucleus sound signal band passed signal processed enhance onsets offsets onset offset signals compressed clustered time frequency channels network integrate neurons onsets offsets spikes timing spikes segment sound background traditional speech interpretation techniques based fourier transforms spectrum hidden markov model neural network interpretation stage limitations continuous speech interpreting speech presence noise interest front ends modelling biological auditory systems speech interpretation systems meyer cole auditory modelling systems similar early auditory processing biological systems mammalian auditory processing ears incoming signal filtered external canal membrane vibration passed middle window cochlea inside cochlea pressure wave pattern vibration occur basilar membrane appears active process outer hair cells organ movement detected hair cells turned neural impulses neurons spiral ganglion pass auditory nerve arrive parts cochlear nucleus nerve areas lateral medial nuclei superior smith inferior colliculus virtually modern sound speech interpretation systems form band pass filtering biology cochlea fourier trans forms perform calculation energy band time period cochlea auditory front ends differ extent length follow animal early auditory processing term generally implies filters high temporal resolution maintained initial stages means filtering techniques fourier transforms bandpass stage filtering systems implemented directly silicon lazzaro mead lazzaro schaik auditory models moved cochlear filtering hair cell modelled simple rectification smith based work brown lazzaro experimented silicon version autocorrelation processing lazzaro mead meyer brown smith considered early brainstem nuclei contribution based neurophysiology cell types auditory modelbased systems find speech recognition systems work presented auditory modelling onset cells cochlear nucleus adds temporal neural network clean segmentation produced part smith system biological plausibility effective datadriven segmentation technique implement silicon techniques digitized sound applied auditory front sound channels bandwidth centre frequency band moore rectified modelling effect hair cells signals produced bear resemblance auditory nerve real system channels nerve channel carries information coding models signal population neighboring auditory nerve filter signal present auditory nerve stronger onset tone effect pronounced cell types cochlear nucleus fire strongly onset sound band sensitive silent emphasis onsets modelled convolving signal band filter computes averages recent recent recent recent biologically justification neuron receiving driving input excitatory input shorter inhibitory input exponentially weighted averages averages formed filter smith place emphasis recent part signal making effective sound segmentation filter output input signal determine rise fall times pulses system sensitive gaussians convolving filter positive peak crosses negative values system sensitive energy rises falls occur sounds positive signal implies signal increasing intensity negative signal implies decreasing intensity convolution sound analog difference gaussians operator extract edges images marr smith performed segmentation directly signal compressing signal signal divided signals onset signal consisting part offset signal consisting inverted part compressed logarithmically increases dynamical system models biological effects compressed onset signal models output population onset cells technique producing onset signal related integrateandfire neural network segment sound onset offset signals integrated frequency bands time temporal clustering achieved network integrateandfire units integrateandfire unit weighted input time activity initially input neuron dissipation describes integration reaches threshold unit fires pulse reset firing period input called refractory period neurons discussed integrateandfire neuron neuron received input ther single adjacent channels equal positive weighting output neuron back adjacent neurons fixed positive weight time step leaky nature accumulation activity excitatory input neuron arriving activation threshold effect firing time excitatory input arriving activation lower similar input applied neurons adjacent channels effect interneuron connections fires neighbors fire immediately network neurons cluster onset offset signals producing sharp burst spikes number channels providing unambiguous onsets offsets external internal weights network adjusted onset offset input allowed neurons fire internal input smith firing refractory period onset system offset system onset system effect produce sharp onset firing responses adjacent channels response sudden increase energy channels grouping onsets temporally onsets generally marked output stage call onset offsets tend gradual physical effects sound start element starts move slowly vibration discussion vibration stop sound slowly echoes reliably mark offset sound reduce refractory period offset neurons produce train pulses duration offset call output stage offset results technique datadriven applied sound source applied speech musical sounds figure shows effect applying techniques discussed short piece speech shows neural network integrates onset channels allowing onsets segmentation simplest technique divide continuous speech onset ensure occasional onset single channel system onsets occur result short segments segmentation boundary onsets inside period minimum segment length utterance neural information processing systems phonetic representation segmented segments text spoken slowly phonetic representation segmenting technique segments phonemes broken segments system tive segmentation insensitive speech rate system effective finding speech inside types noise noise system segment sound single musical instruments clear breaks notes straightforward smith correct segmentation achieved directly signal achieved sounds notes change smoothly visible figure onsets clear network segmentation produced sound segmentation figure offset maps author neural information cessing systems rapidly envelope original sound onset channels onset filter parameters text neuron channel interconnection neuron refractory period network input applied adjacent channels internal feedback chan offset produced similarly refractory period envelope nice noise background lines mark utterance onset offset maps perfect results obtained input network spread channels conclusions work effective data driven segmentation technique based onset feature detection integrateandfire neurons demonstrated broadband noise segmentation effectiveness technique depend application smith figure sound vertical lines showing boundary notes onsets found single neuron channel interconnection internal feedback channel adjacent channels offsets found refractory period segmentation information bands onsets propose extend work combining segmentation work bands sharing amplitude modulation extract sound segments subset bands allowing segmentation streaming concurrently acknowledgements members centre cognitive computational university references meyer speech analysis means model cochlear nerve nucleus visual representations speech signals cooke modelling intermediate auditory system mathematics applied biology medicine publishing canada classification unit types cochlear nucleus histograms regularity neurophysiology sound segmentation meyer model processing voiced auditory nerve cochlear nucleus proceedings inst acoustics brown computational auditory scene analysis department computing science university england cole challenge spoken language systems research directions ieee trans speech audio processing auditory models speech technology intelligent perceptual models springer verlag schaik linear predictive coding speech signal analog cochlear model internal report center systems switzerland world approach auditory event perception psychology chang responses neurons nerve cats pure tones analysis hearing research lazzaro mead silicon modelling pitch perception natl acad ences lazzaro wawrzynek mahowald sivilotti silicon auditory processors computer ieee trans neural networks theory pitch perception andreou goldstein analog cochlear model multiresolution speech analysis advances neural information processing systems hanson cowan giles morgan kaufmann marr theory edge detection proc royal society london simulation transduction studies acoust moore suggested formulae calculating bandwidths excitation patterns acoust america synchronization biological oscillators siam appl math introduction auditory cessing introduction physiology hearing edition academic press efficient implementation auditory filter bank technical report computer smith sound segmentation onsets offsets music research smith coding interpretation segmentation sound march schwartz theoretical study neural mechanisms specialized detection events proc paris
2 recognizing handprinted letters digits recognizing handprinted letters digits martin james pittman austin texas abstract developing handprinted character recognition system multi layered neural trained backpropagation report results training nets samples handprinted digits scanned bank checks handprinted letters entered computer large training sufficient capacity achieve high performance training nets typically achieved error rates reject rate reject rate topology capacity system measured number connections surprisingly effect generalization developing practical pattern recognition systems results suggest large representative training sample single important factor achieving high recognition accuracy standpoint raise relevance backpropagation learning models estimate likelihood high generalization estimates capacity reducing capacity benefits accomplished local receptive fields shared weights case find evolves feature detectors resembling visual cortex linskers orientationselective nodes practical interest handprinted character recognition current tech trends systems interpret ments computers replace enables users write draw directly fiat panel display paper report results applying multilayered neural nets trained backpropagation rumelhart hinton williams cases developing pattern recognition systems typically twostage process intuition experimentation select features represent input tern variety techniques optimize system assumes representation applications backpropaga tion learning character recognition learning capabilities martin pittman classifier system burr denker gardner graf howard hubbard jackel baird guyon mori backpropagation learning optimize feature selection pattern classification simultaneously avoid predetermined features input favor segmented grayscale array character step goal approximating input projected human retina input required report results handprinted digits letters handprinted digits handprinted digits scanned amount region realworld bank checks grayscale array test consists samples training sets varied samples difficult compare recognition rates arising pattern sets difficulty tion gained human performance data benchmark independent person test digits achieved error rate figure considerably performance operators numbers directly bank checks segmentation gorithm working letters digits enables tests generality results pattern double number output categories handprinted letters letters collected people writing input device flat panel display sequence coordinates points spatial resolution points temporal sequence character converted size normalized array keeping aspect ratio constant found recogni tion accuracy significantly improved blurred convolution gaussian pattern represented grayscale image test samples extracted selecting samples people training sets generated people generating test training sizes ranged roughly samples high recognition accuracy find high recognition accuracy pattern sets table reports minimal error rates achieved test samples pattern sets reject rates case handprinted digits error rate effects number training samples network capacity topology reported section nets trained error rates training began learning rate momentum learning rate decreased training accuracy began oscillate stabilized large number training epochs evaluate output vector basis consistently improves accuracy results network parameters smaller effect performance recognizing handprinted letters digits proaches errors made human judge suggests generalization require improving segmentation accuracy fact error rate achieved letters promising accuracy fairly high table error rates nets trained largest sample sets tested samples reject rate digits letters large number categories error rate applications contextual constraints significantly boost accuracy wordlevel minimal network capacity topology effects effects network parameters generalization practical scientific significance practical pattern recognition systems interested effects determine limited resources spent opti network parameters collecting larger representative training effects capacity bear relevance learning models back propagation central premise general models size initial search capacity number train samples needed achieve high generalization performance learning search function maps inputs correct outputs learning occurs comparing successive samples inputoutput pairs functions search space functions inconsistent training samples rejected large training sets narrow search function closely approximates function yields high generalization capacity learning number functions generalization larger initial search space requires training samples narrow search sufficiently suggests improve generalization capacity minimized typically unclear minimize capacity eliminating desired function search space heuristic suggested simple receives support experience curve fitting loworder polynomials ically extrapolate interpolate highorder polynomials duda hart extensions heuristic neural learning propose reducing capacity number connections number bits represent connection martin pittman weight baum haussler denker schwartz solla howard jackel hopfield manipulated capacity nets number ways varying number hidden nodes limiting connectivity layers nodes input local areas sharing connection weights hidden nodes found effects generalization number hidden nodes figure presents generalization results function training size nets hidden layer varying numbers hidden nodes number free parameters number connections biases case presented parentheses spite considerable variation number free parameters nets fewer hidden nodes improve generalization baum haussler estimate number training samples required achieve error rate generalization test error rate achieved training assume feedforward hidden layer connections estimates sense calculations assume arbitrary function number training samples order refers number nodes certainty achieve generalization rates estimate number training samples needed provide lower digits number hidden nodes letters number hidden nodes training size training size figure effect number hidden nodes training size generalization recognizing handprinted letters digits bound estimate order fewer number samples functions fail achieve generalization rates fact find advantage reducing number connections baum estimates underlying assumption capacity plays strong role generalization baum haussler suggest constant proportionality esti implies achieving error rates samples requires times training examples connection weights largest nets implies requirement roughly million training samples regard prohibitively large found samples sufficient sufficiently large training sample imply large sample character recognition find sample sizes order thousands tens thousands yield performance close human reason discrepancy baum estimates distribu sense reflect worstcase scenarios func tions learn functions underlying natural pattern recognition tasks representative functions sults raise relevance natural pattern recognition learning models based worstcase analyses content greatly impact generalization local connectivity shared weights biologically plausible reduce capacity limit connectivity layers local areas shared weights visual cortex neurons responsive feature oriented line appearing small local region retina hubel wiesel oriented essentially replicated visual field feature detected appears sense connections feeding oriented line detector shared similar areas visual field neural local structure achieved limiting connectivity hidden node receives input local areas input hidden layer preceding weight sharing achieved linking incoming weights hidden nodes weights leading nodes randomly initial ized values forced equivalent updates learning evolves local feature invariant input array exist indicating local connectivity shared weights prove generalization performance tasks position invariance required rumelhart hinton williams examined benefits local receptive fields shared weights hand printed character recognition position invariance required minimize importance position invariance underlying reliable pattern recognition explicitly dont bias discovering testing role local receptive fields shared weights martin pittman situations position invariance required relevant discovering constraints role position invariance figure find slightly improved generalization moving nets global connectivity layers nets local receptive fields nets local receptive fields shared weights true fact number free parameters substantially reduced positive effects occur small training sizes explain reported greater degree improved generalization local receptive fields data reported networks hidden layers global nets nodes layer nodes layer nodes received input local overlapping regions offset pixels input array hidden layer nodes output layer nodes global receptive fields local shared nets nodes hidden layer shared weights hidden layer digits ters nodes local overlapping shared receptive fields size digits global local local shared letters training size training size local local shared figure effects capacity topology generalization hidden layer experimented large variety architectures sort varying number hidden nodes sizes overlap local receptive fields local receptive fields shared weights hidden layers fact found difference generalization pattern sets variations network architectures generality results recognizing handprinted letters digits discussion architecture enables high training performance find small effects network capacity topology generalization performance large training yields high recognition accuracy robust architectures worked results suggest practical advice developing handprinted character recognition systems optimizing general ization performance goal limited resources large representative training extensive experimentation architectures variations capacity topology examined substantially affect generalization performance sufficiently large training sets sufficiently large interpreted order thousand tens thousands samples handprinted character recognition theoretical standpoint negligible effects network capacity generaliza tion performance central premise machine learning size initial hypothesis space determines learning performance challenges backpropagation learning statistical models estimate likelihood high generalization performance estimates capacity gradient ment nature backpropagation learning functions represented visited learning negligible effects capacity suggest number functions visited learning constitutes small percentage total functions represented number reasons capacity impact generalization performance circumstances regularly error rates helps avoid possibility overfitting data tion trained higher levels long large training sets number connections good measure capacity amount information passed connection measure number connections conference denker solla howard jackel presented evidence removing portant weights network reduce capacity fact generalization rates close human accuracy levels nets extremely large numbers free parameters suggests general effects capacity topology small comparison effects training size dont topologies push performance human accuracy levels biasing discovering range underlie human pattern recognition problem explicitly position size rotation bias discovering full range martin pittman advantages reducing capacity reducing gross indicators capacity significantly improve general ization good practical scientific reasons good reason reduce number connections speed processing local receptive fields shared weights biases position invariance simpler modular internal representation replicated large retina important implications developing nets combine charac segmentation recognition local receptive fields shared weights offers promise increasing understanding correctly patterns number receptive fields greatly reduced figure depicts hinton diagrams local digits letters figure receptive fields evolved hidden layer nodes nets local receptive fields shared weights fields hidden layer nodes nets shared weights trained digits letters large gray rectangles corresponds receptive field hidden node left trained digits trained letters black rectangles correspond negative weights white positive weights size black white rectangles reflects magnitude weights local feature detectors develop pattern sets oriented line edge detectors similar oriented line edge detectors found visual cortex hubel wiesel linskers tive nodes emerge exposed random patterns linskers case feature detectors develop emergent property principle signal transformation occurring layer maximize information output signals convey input signals fact similar recognizing handprinted letters digits feature detectors emerge backpropagation nets trained natural patterns explicit constraints maximize information flow tween layers backpropagation nets categorization typically viewed abstraction process involving considerable loss informa tion references baum haussler size valid generalization touretzky advances neural information processing systems morgan kaufman burr neural network digit recognizer proceedings international conference systems cybernetics denker gardner graf henderson howard hubbard jackel baird guyon neural network recognizer handwritten code digits touretzky advances neural information processing systems morgan kaufman denker schwartz solla howard jackel hopfield large automatic learning rule extraction generalization complex systems duda hart pattern classification scene analysis john wiley sons experimental results generation local receptive fields global convergence improve perceptual learning connectionist networks computer science department university hubel wiesel brain mechanisms vision scientific american generalization network design strategies technical report department computer science university toronto linsker basic network principles neural architecture emergence orientationselective cells proceedings national academy sciences linsker organizing principle layered perceptual network anderson neural information processing systems american institute physics martin pittman mori neural networks learn discriminate similar kanji characters touretzky advances neural information processing systems morgan kaufman rumelhart hinton williams learning internal representations error propagation rumelhart mcclelland editors parallel distributed processing cambridge mass press comparison nearest neighbor classifier neural network character recognition ieee international conference neural networks washington acknowledgements corporation handprinted digits bauer invaluable handprinted letters
3 stochastic complexity admissible models neural network classifiers smyth communications systems research propulsion laboratory california institute technology pasadena abstract training data choose network clas family networks complexities paper discuss application stochastic complexity theory classifier design problems provide insights problem introduce notion admissible models complexity models consideration affected factors class entropy amount training data prior belief discuss implications results respect neural architec tures demonstrate approach real data medical diagnosis task introduction motivation paper examine general sense application minimum description length techniques problem selecting good classifier large candidate models hypotheses pattern recognition algorithms differ conventional statistical modeling techniques sense typically choose large number candidate models describe data problem searching candidate models frequently approached practice greedy algorithms context techniques eliminate portions hypothesis space considerable interest show paper intrinsic structure formalism eliminate large numbers candidate models minimal information data results depend stochastic complexity simple notion models complex problem models complexity exceeds data discarded consideration search parsimonious model background stochastic complexity theory general principles stochastic complexity general theory inductive inference data unlike traditional inference techniques takes account plexity proposed model addition standard model data detailed rationale reader referred work rissanen freeman references note minimum description length technique approach implicitly related maximum bayesian estimation techniques cast framework minimum description length stochastic complexity notation barron cover datapoints scribed sequence observations referred short correspond values random variables continuous discrete purposes paper elements finite alphabet discrete mary class variable family candidate models consideration note defining function number data points possibility complicated models data arrives nonnegative numbers interpreted cost bits model turn prior probability assigned model suitably normalized refer coding scheme total description length data model defined describe model class data relative model function feature data stochastic complexity data relative minimum description length problem finding model shortest description length intractable general case nonetheless idea finding model motivated works practice preferable alternative approach ignoring complexity issue smyth admissible stochastic complexity models definition find define notion admissible model classification problem admissible models defined models complexity exists model description length smaller words models complexity bits greater description length model terms description length eliminated consideration defined dynamically function description lengths calculated search typically predefined class feedforward neural networks activation functions restrict search good model models practical practice difficult determine exact boundaries large decision trees neural networks note notion seek minimal description length equivalently model posterjori probability situations goal average number models bayesian manner modification criterion results admissible models simple techniques eliminating obvious models interest classification problem condition model admissible entropy mary class variable obvious interpretation words admissible model complexity data easy show addition complexity admissible model upper bounded parameters classification problem size space admissible models bounded approach suggests classification number classes strict limitations admissible models theory state larger subset necessarily result optimal model found difficult argue case including large numbers models complex problem approach lead inefficient search worst poor model chosen result poor coding scheme large hypothesis space stochastic complexity admissible models bayes risk notion minimal compression minimum achievable related classification problem minimal bayes risk problem model necessarily unique achieves optimal bayes risk minimizes classifier error classi fication problem necessarily practical problems interest nonzero ambiguity mapping feature space class variable addition defined admissible limit fundamental error resentation family models consideration flexible optimally represent mapping smyth shown information bayes error rate problem bounds applying minimum description length principles neural network design principle results applied variety classifier design problems applications markov model selection decision tree design smyth paper limit attention problem automatically selecting feedforward multilayer network architecture calculation clear preceding discussion application principle clas selection requires classifier produce posterior probability estimate class labels context network model problem provided network trained provide estimates requires simple modification objective function loglikelihood function class label training datum networks estimate function proposed literature past crossentropy measure special case binary classes recently derived basic arguments minimum mutual information bridle maximum likelihood estimation crossentropy function network training component description length criterion equivalent case special cases procedure complexity term constant left optimization models assumed equally likelihood decision criterion complexity penalization multilayer perceptron models proposed past barron penalty term number parameters weights biases network complexity measure general arguments originally proposed rissanen penalty term large cybenko smyth pointed existing successful applications networks param eters possibly justified statistical analysis amount training data construct network critical factor lies precision parameters stated final model essence principle bayesian techniques data justifies parameter model finite precision inversely proportional inherent variance estimate approximate techniques calculation complexity terms manner proposed weigend huberman rumelhart volume complete description length analysis appeared literature complexity discrete network model turns alternatives multilayer perceptrons complexity easier calculate rulebased network goodman model hidden units correspond boolean combinations discrete input variables link weights hidden output class nodes proportional conditional probabilities class activation hidden node output nodes form estimates posterior class probabilities simple summation normalization implicit assumption conditional independence practice fact hidden units chosen manner ensure assumption violated complexity penalty network calculated link hidden output layers coding term tion hidden units description length network hidden units order hidden node prior probability orders definition description length earlier results admissible models number hidden units architecture upper bounded number binary input attributes application medical diagnosis problem application techniques discovery parsimonious network breast cancer diagnosis discrete network model common technique breast cancer diagnosis obtain fine patient sample evaluated makes diagnosis ground truth form binary class labels benign malignant obtained stage mangasarian collection database information stochastic complexity feature information consisted subjective evaluations sample characteristics uniformity cell size marginal training data consists samples obtained real patients assigned class labels prior class entropy immediately state bounds networks hidden units evaluate models narrow region results stated earlier figure graphical interpretation procedure region description length bits figure region function description length algorithm effectively moves lefthand axis adding hidden units greedy manner initially description length lower curve decreases rapidly capture gross structure data model calculate description length turn calculate upper bound upper curve bound linear description length time hidden units models hidden units finally local minimum description length function reached units point optimal solution hidden units matter interest resulting network hidden units correctly classified independent test cases conclusion variety related issues arise context briefly mention space constraints prior model entropy affect complexity search problem questions naturally arise grow function incremental learning scenario conclusion paper consideration admissible models major factor inductive inference choice description lengths models efficient optimization smyth techniques seeking parameters model remain success nonetheless results provide theoretical insight practical extent provide check model selection acknowledgments research paper performed propulsion california institute technology contract national space administration addition work supported part force office scientific research grant number references barron statistical properties artificial neural networks ceedings ieee conference decision control barron cover minimum complexity density estimation ieee trans inform theory bridle training stochastic model recognition algorithms networks lead maximum mutual information estimation parameters advances neural information processing systems mateo morgan kaufmann cybenko complexity theory neural networks classification prob lems preprint maximum likelihood training neural networks proceedings international workshop statistics hand chapman hall london goodman miller smyth rulebased approach neural network classifiers proceedings international neural network conference paris france image pattern recognition translated brown york springer verlag rissanen universal coding information prediction estimation ieee trans inform theory smyth admissible stochastic complexity models classification prob lems proceedings international workshop statistics hand chapman london freeman estimation inference compact coding royal star mangasarian method pattern applied breast diagnosis proceedings national academy sciences press
11 learning meets recursive squares algorithm abstract learning memorybased technique query extracts prediction interpolating locally neighboring exam ples query considered relevant distance measure paper propose datadriven method select basis optimal number neighbors considered prediction efficient identify validate local models recursive squares algorithm introduced text local approximation learning strategy model selection local combination promising models explored method proposed tested datasets compared stateoftheart approach introduction learning computation explicit request prediction received request fulfilled interpolating locally examples relevant distance measure prediction requires local modeling procedure composed structural parametric iden parametric identification consists optimization parameters local approximator hand structural identification involves things selection family local approximators selection metric evaluate examples relevant selection bandwidth size region data correctly modeled members chosen family approximators comprehensive tutorial local learning references atkeson problem bandwidth selection concerned approaches exist choice bandwidth performed based priori assumption data datadriven approaches interest hand constant bandwidth case global optimization minimizes error criterion dataset hand bandwidth selected locally tailored query point present work propose method belongs class local datadriven approaches assuming fixed metric local linear approximators method introduce selects bandwidth basis means local crossvalidation problem bandwidth selection reduced selection number neighboring examples nonzero weight local modeling procedure time prediction required specific query point local models identified including number neighbors generalization ability model assessed local crossvalidation procedure finally prediction obtained combining selecting models basis statistic crossvalidation errors main reason favor bandwidth selection adaptation local characteristics problem hand approach handle directly case database updated online hand globally optimized bandwidth approach principle require global optimization repeated time distribution examples major contribution paper consists recursive squares algorithm context learning appealing efficient solution intrinsically incremental problem identifying sequence local linear models centered query point including growing number neighbors worth leaveoneout crossvalidation model considered involve significant computational obtained press statistic myers simply partial results returned recursive squares algorithm schaal atkeson recursive squares algorithm incremental update local models present paper time algorithm perspective effective explore neighborhood query point contribution propose comparison local scale competitive cooperative approach model selection problem extracting final diction alternatives compared strategy strategy based combination wolpert section experimental analysis recursive algorithm local identification validation presented algorithm proposed conjunction strategies model selection combination compared experimentally rulebased tool developed ross quinlan generating piecewiselinear models local weighted regression variables mapping examples obtained random variable unknown moment distribution defined function mentioned properties implies assumption global made learning meets recursive squares algorithm problem local regression stated problem estimating regression function assumes specific query point information neighborhood query point hypothesis local parameter local linear approximation neighborhood obtained solving local polynomial regression metric space distance query point weight function bandwidth constant vector order constant term regression matrix notation solution stated weighted squares problem matrix vector element diagonal matrix diagonal element matrix assumed nonsingular inverse defined obtained local linear polynomial approximation prediction finally exploiting linearity local approximator leaveoneout cross validation estimation error variance obtained significant fact press statistic myers calculate error explicitly identifying parameters examples removed formulation press statistic case hand diagonal element matrix recursive local regression sake simplicity focus linear approximator extension generic polynomial approximators degree straightforward assume metric space attention centered problem bandwidth selection weight function indicator function adopted optimization parameter conveniently reduced opti mization number neighbors unitary weight assigned local regression evaluation words reduce problem bandwidth selection search space nearest neighbor query point main advantage deriving weight function defined simply updating model identified nearest neighbors straightforward inexpensive fact performing step standard recursive squares algorithm nearest neighbor query point matrix leaveoneout crossvalidation errors directly calculated model identification define vector leaveoneout errors initialization recursively evaluate values local approximation regression function prediction regression function query point vector leaveoneout errors extract estimate variance prediction error notice priori estimate parameter ance matrix reflects reliability initialization adopted large identity matrix local model selection combination recursive algorithm returns query point predictions leaveoneout error vectors information final prediction regression function obtained ways main paradigms considered based selection approximator criterion returns prediction combination local models selection paradigm frequently called adopted natural extract final prediction consists comparing prediction obtained basis classical square error criterion argmin learning meets recursive squares algorithm table summary characteristics datasets considered dataset housing prices number examples number weights conveniently discount error distance query point point error corresponds atkeson alternative paradigm explored effectiveness local combinations estimates wolpert adopting case square error criterion final prediction obtained weighted average models parameter algorithm suppose predictions error vectors ordered creating sequence integers prediction weights inverse square errors generalized ensemble method perrone cooper experiments results experimental evaluation incremental local identification validation algorithm performed datasets quinlan obtained repository machine learning databases murphy provided breiman summary characteristics dataset presented table methods compared adopt recursive identification validation algorithm strategies model selection combination considered approaches selected globally local bandwidth selection linear local models number neighbors basis prediction returned model square error criterion local bandwidth selection constant local models algorithm constant models derived directly recursive method model selected square error criterion local combination estimators method datasets proposed query linear local models constant models combined global bandwidth selection linear local models obtained prediction error crossvalidation dataset query points global bandwidth selection constant local models optimized globally constant queries table absolute error unseen cases method housing prices table relative error unseen cases method housing prices metric concerned adopted global euclidean metric based relative influence relevance friedman confident local metric improve performance learning method results methods introduced compared obtained experimental settings rulebased tool developed quinlan generating piecewiselinear models approach tested dataset crossvalidation strategy dataset divided randomly groups equal size turn groups testing remaining providing examples methods performed prediction unseen cases examples table present results obtained methods averaged crossvalidation groups methods compared examples conditions sensitive paired test significance significantly significance level consideration results concerns local combination estimators table method performs average winner linear constant dataset significantly dataset significantly average consideration comparison bandwidth selection global optimization number neighbors average performs counterparts datasets significantly dataset significantly comparison concerned recursive identification validation proposed obtains results comparable obtained stateoftheart method implemented datasets performs time significantly time significantly worse learning meets recursive squares algorithm index performance investigated relative error defined square error unseen cases normalized variance test relative errors presented table show similar picture table square errors considered penalize larger absolute errors conclusion future work experimental results confirm recursive squares algorithm tively local context trivial metric adopted local combination estimators identified recursively showed compete stateoftheart approach future work focus problem local metric selection sophisticated ways combine local estimators extend work polynomial approximators higher degree acknowledgments work supported program work supported european union grant authors ross quinlan gratefully acknowledge software details corn breiman dataset repository datasets paper references artificial intelligence review special issue learning atkeson moore schaal locally weighted learning artificial intelligence review factorization methods discrete sequential estimation york academic press learning local modeling control design international journal control accepted publication friedman flexible metric nearest neighbor classification tech depart ment statistics stanford university murphy machine learning databases myers classical modern regression applications boston pertone cooper networks disagree ensemble methods hybrid neural networks pages artificial neural networks speech vision chapman hall quinlan combining instancebased modelbased learning pages machine learning proceedings tenth international conference morgan kaufmann schaal atkeson constructive incremental learning local information neural computation wolpert stacked generalization neural networks
5 predicting complex behavior sparse asymmetric networks william levy department health sciences center university abstract recurrent networks threshold elements studied associative memories devices research concentrated fullyconnected symmetric works relax stable fixed points asymmetric networks show richer dynamical behavior sequence generators flexible devices paper approach problem predicting complex global behavior class asymmetric networks terms network parameters works show fixedpoint effectively aperiodic behavior depending parameter values approach parameters obtain desired complexity dynamics approach qualitative insight system behaves suggests applications introduction recurrent neural networks threshold elements investigated recent years part interesting dynamics interest focused symmetric connections relax stable fixed points hopfield associative memories devices networks asymmetric connections potential predicting complex behavior sparse asymmetric networks richer dynamic behavior learning sequences amari sompolinsky kanter paper introduce approach predicting complex global behavior interesting class random sparse asymmetric networks terms network parameters approach parameter values obtain desired activity level qualitatively dynamic behavior network parameters equations network consists identical neurons threshold fixed pattern excitatory connectivity neurons generated prior simulation bernoulli process probability connection neuron neuron excitatory connections fixed global inhibition linear number active neurons number active neurons time weight excitation firing status neuron variable indicating presence absence connection equations equation simple variant shunting inhibition neuron model studied researchers network similar posed mart mart note combined write neuron equations familiar inhibition format defining network behavior paper study evolution total activity system equation firing condition neuron time activity time order fire time neuron active inputs calculate average firing probability neuron prior activity active inputs large gaussian approximation binomial distribution levy hyperbolic tangent approximation error function finally large assume simpler form assuming neurons fire independently tend large sparse networks levy networks activity time distributed leads stochastic return activity figure plot neuron network values vertical bars show standard deviations side clear networks activity falls range predicted initial transient period system switches correspond activity fixed point trapped region point defined call attracting region size location attracting region determined largely qualitative dynamic behavior network ranges networks show kinds behavior fixed points short cycles effectively aperiodic dynamics describing behaviors introduce notion neurons number input connections neuron neuron possibly meet firing criterion time neuron activity group neurons considered neurons specific activity unique neurons network rons active time step average size activity predicting complex behavior sparse asymmetric networks aperiodic behavior high activity figure predicted distribution empirical data networks vertical bars represent standard deviations predicted tribution note empirical values fall predicted range behavior activity time step figure activity timeseries kinds behavior shown neuron work graphs correspond data shown figure levy shown neurons achieve average activity describe kinds dynamic behavior exhibited networks fixed point behavior small close inhibition strong control activity neurons switch large close stochastic dynamics eventually finds remains activity fixed point effectively aperiodic behavior deterministic finite state systems networks show aperiodic chaotic behavior time tion long make dynamics effectively aperiodic occurs attracting region moderate activity level defined number neurons situation network start initial condition successively visits large number states activity yields aperiodic timeseries shown figure behavior attracting region high activity level neurons fire time step order maintain activity predicted forces network states similar turn leads similar successor states network settles short limit cycle high activity figure attracting region activity level network activity limit cycle mediated small group high fanin neurons figure effect unstable regard initial conditions expected significant increasing network size variance variance figure neuron firing probability histograms networks tively aperiodic phase graph network random connectivity generated bernoulli process graph network fixed fanin corresponds fanin predicting complex behavior sparse asymmetric networks interesting issue arises context effectively aperiodic behavior statespace sampling constraint activity assess histogram individual neuron firing rates figure shows histogram neuron network effectively aperiodic phase subspaces sampled histogram broad differences fanin individual neurons larger networks shows neuron firing histogram neuron network neuron fanin sampling ergodic dynamics biased subspaces figure complete nonzero activation values identical rons fanin network levy activation dynamics modeling focused neural firing underlying neuron activation values values neuron fanin represents number active inputs represents activation values neuron networks ndimensional activation state evolves activa tion space extremely complex regular object figure plot subspace projection called plot activation space network excluding states neurons shown fanin small subset activation space sampled constraining effects dynamics values relating activity level practical standpoint average activity work related parameter hyperbolic tangent approxi mation equation define activity level time proportion active neurons variable sense amari long term activity level confined region activity fixed point reasonable estimate activity level relate solve fixed point equation substituting definition figure predicted empirical activities neuron networks data point averaged networks predicting complex behavior sparse asymmetric networks range approximation breaks high small values range applicability wider increases figure shows performance predicting average activity level network note leads equation conclusion studied general class asymmetric networks developed statistical model relate dynamical behavior parameters behavior largely characterized composite parameter varied understanding behavior insight complex possibilities offered sparse asymmetric networks regard modeling brain regions hippocampal area mammals complex behavior random asymmetric networks discussed parisi parisi show trol complexity networks setting parameters appropriately acknowledgements research supported department university john references amari learning patterns pattern sequences selforganizing nets threshold elements ieee trans computers amari method statistical neurodynamics hopfield neural networks physical systems emergent collective computational abilities proc acad mart simple memory theory phil trans lond levy dynamics sparse random networks review levy setting activity level sparse random works review length attractors asymmetric random neural networks deterministic dynamics phys math parisi asymmetric neural networks process learning phys math sompolinsky kanter temporal association asymmetric neural works phys lett
2 discovering high order features field modules discovering high order features modules field galland geoffrey hinton physics dept computer science dept university toronto toronto canada abstract form deterministic boltzmann machine learn procedure presented efficiently train network discriminate input vectors technique directly utilizes free energy field modules represent probability criterion free energy readily manipulated learning procedure conventional deterministic boltzmann learn fails extract higher order feature shift network bottleneck combining field modules information objective function rapidly produces modules perfectly extract important higher order feature direct external supervision introduction boltzmann machine learning procedure hinton sejnowski made efficient field approximation stochastic binary units replaced deterministic realvalued units peterson anderson deterministic boltzmann learning tasks subsets units treated input output varied trial trial peterson respect resembles learning procedures involve settling stable state pineda paradigm force network explicitly extract important higher order features ensemble training vectors forcing network pass information required correct completions narrow bottleneck backpropagation networks hidden layers learning discover important galland hinton underlying features hinton original demonstrate idea effectively hidden layers initial simulations conventional techniques successful combined type learning objective function resulting network extracted crucial higher order features rapidly perfectly task figure shows network input vector divided parts random binary vector generated shifting left pixel random binary vector generated shift generate means uniquely fourth filter ambiguous cases true perform correct completion network explicitly represent shift single unit connects halves shift order property extracted hidden units figure simulations standard deterministic boltzmann learning discussion assumes familiarity deterministic boltzmann learn procedure details obtained hinton positive phase learning sets shift matched vectors clamped inputs negative phase allowed settle unclamped weights changed training case online version learning procedure choice input clamp changed systematically learning process left unclamped equally technique successful problems hidden layer train network correctly perform task input layers settle correct state clamped result single discovering high order features field modules central unit failed extract shift general learning procedure stochastic difficulty learning tasks layer nets failure development procedure form correctly extract shift hidden layers direct external supervision learning procedure field modules unit states range free energy settles free energy minimum nonzero temperature states units minimum derivative respect weight assuming hinton owij suppose network module discriminate input vectors criterion input vectors dont output unit degree view negative field free energy module measure clamped input vector standpoint define probability input vector fits criterion equilibrium free energy module vector clamped inputs supervised training performed crossentropy error function hinton input cases criterion cases dont crossentropy expression error galland hinton derivatives error derivatives weight obtained equation module trained gradient descent high free energy negative training cases free energy positive cases positive case owij owij negative case owij owij test procedure trained shift detecting module composed input units hidden units figure free energy shifts weight changed online fashion awij shifted case awij left shifted case sweeps training cases required successfully train module detect shift training easy hidden units receive connections input units clamped network settles free energy minimum iteration details simulations galland hinton maximizing mutual information field modules learning procedure inherently supervised discover shift important underlying feature method discovering high order features field modules modules obvious implementing idea quickly creates modules agree maximize mutual information stochastic binary variables represented free energies modules strong pressure binary variable high entropy cases mutual information binary variables entropy joint distribution training cases entropies individual distributions field modules stochastic binary variables case free energy module training case clamped input compute probability module averaging input sample distribution prior probability input case similarly compute values joint probability distribu tion equation partial derivatives individual joint proba bility functions respect weight module readily calculated hinton entropy stochastic binary variable entropy joint distribution partial derivative respect single weight module computed depend differentiate shown galland hinton derivative derivation drawn becker hinton show mutual information learning signal backpropagation nets perform gradient ascent weight modules procedure probabilities cases accumulated pass approach applied system field modules left halves figure connecting central unit detect shift task random binary vectors clamped inputs related shift modules provide mutual information representing shift maximizing mutual information created perfect shift detecting modules sweeps training cases training module found free energy left shifts high free energy details simulations galland hinton discovering high order features field modules summary standard deterministic boltzmann learning failed extract high order features network bottleneck explored variant learning free energy module represents stochastic binary variable variant efficiently discover shift important feature external supervision provided architecture objective function designed extract higher order features invariant space acknowledgement becker helpful comments research supported grants ontario information technology research center national science engineering research council canada geoffrey hinton fellow canadian institute advanced research references becker hinton spatial coherence internal teacher neural network technical report university toronto galland hinton experiments discovering high order features field modules university toronto connectionist research group technical report forthcoming hinton learning distributed representations concepts proceedings eighth annual conference cognitive science society amherst mass hinton connectionist learning procedures technical report carnegie mellon university hinton deterministic boltzmann learning performs steepest descent weightspace neural computation hinton sejnowski learning boltzmann machines rumelhart mcclelland group parallel distributed processing microstructure cognition volume foundations press cambridge hopfield neurons graded response collective computational properties twostate neurons proceedings national academy sciences peterson anderson field theory learning algorithm neural networks systems peterson explorations field theory learning algorithm technical report computer technology corporation austin pineda generalization backpropagation recurrent neural works phys lett
3 discovering discrete distributed representations iterative competitive learning michael mozer department computer science institute cognitive science university colorado boulder abstract competitive learning unsupervised algorithm classifies input terns mutually exclusive clusters neural framework clus represented processing unit winner takeall pool input pattern present simple extension algo rithm construct discrete distributed representations discrete representations easy analyze information content readily measured distributed representa tions explicitly encode similarity basic idea apply competitive learning iteratively input pattern stage subtract input pattern component captured representation stage component simply weight vector winning unit competitive pool subtraction procedure forces competitive pools stages encode aspects input algorithm essentially traditional data compression tech nique multistep vector quantization neural suggests potentially powerful extensions approach introduction competitive learning grossberg kohonen rumelhart zipset malsburg unsupervised algorithm classifies input patterns ally exclusive clusters neural framework cluster represented cessing unit winnertakeall pool input pattern competitive learning constructs local representation single unit response input present simple extension algorithm construct discrete distributed representations discrete representations easy analyze information content readily measured distributed representations explicitly encode begin describing standard competitive learning algorithm mozer competitive learning layer network input units competitive units tive unit represents classification input competitive units input units connected winnertakeall pool single competitive unit active formally activity competitive unit input activity vector connection strengths input units competitive unit denotes vector norm conventional weight update rule step size algorithm moves weight vector center cluster input patterns algorithm attempts develop representation input discrete alternatives representation simply weight vector winning competitive unit develop representation follow durbin competitive learning viewed performing gradient descent error measure index patterns parameter soft competitive learn model bridle rumelhart press specifies degree competition winnertakeall version competitive learning obtained limit extending competitive learning competitive learning constructs local representation input tive learning extended construct distributed representations idea independent competitive pools form partition input space fails pools discover partitioning partitioning force pools encode components input competitive learning network component input encoded simply competitive learning algorithm guaranteed extract information captured pool competitive units infor mation subtracted procedure invoked iteratively capture ferent aspects input arbitrary number competitive pools iterative competitive learning idea heart algorithms performing principal components analysis algorithms discover continuousvalued feature dimensions concerned discovering discrete distributed representations discovery features continuous features quantized form discrete features idea sanger explore cost elaborate formalize model network composed arbitrary number stages figure stage consists input units competitive units input competitive units stage feed activity input units higher stage activity input units stage external input subsequent stages additional index stage number figure iterative competitive learning model reconstruct original input pattern activities competitive units components captured winning unit stage simply summed variant independently proposed granger lynch algorithm inspired neurobiological model competitive unit activation rule product tance measure todd leen steve work attention mozer problem rule difficult interpret network computing aspect input captured winning unit reconstructed resulting activity pattern information activation rule combination learning rule clear putational justification virtue underlying objective measure equation optimized turns virtually identical conventional technique data compression multistep vector quantization gray simple input patterns forming rectangle space located network stages units stage discovers primary dimension variation xaxis units develop weight vectors removing ponent input points points left side rectangle points side stage network discovers secondary dimension variation yaxis response network input pattern summarized competitive units stage activated units stage numbered response patterns generated discovered code represent inputs result inputs input environment consists clusters points centered corners rectangle case code describe input uniquely distinguish clusters image compression discovers compact discrete codes algorithm data image compression problems data transformed compact representation reconstruct original data performs transformation resulting code consisting competitive unit response pattern reconstruction achieved equation experimented pixel image bits gray level information pixel trained random patches image total train trials network input units stages competitive units initial weights random selected normal distribution standard deviation fixed figure shows incoming connec tion strengths competitive units stages connection strengths depicted grid cells shading weight position image patch competitive unit discovering discrete distributed representations stage stage stage stage stage stage stage stage stage figure unit connection strengths stages training image compressed dividing image nonoverlapping patches presenting turn obtaining compressed code reconstructing patch code stage network units stage compressed code bits number bits pixel compressed code obtain levels compression number stages varied fortunately require retraining features detected stage depend number stages earlier stages capture significant variation input network trained stages compress image achieving pixel coding image train originally neural image compression study cottrell munro zipset compression scheme threelayer back propagation autoencoder image patch back hidden layer hidden layer fewer units input layer served encoding hidden unit activities continuous valued standard measure performance signaltonoise ratio rithm average energy relative average reconstruction error forms cottrell network table result surprising data compression literature vector quantization proaches similar work approaches cottrell sanger reason proaches quantization account development code approaches training procedure discovers code quantization step turns code form digital mozer data transmission storage distinct processes cottrell network hidden unit encoding learned demands quantization quantized code retain information signal takes quantization account training table signaltonoise ratio compression levels compression cottrell comparison vector quantization approaches mentioned previously essentially neural reformulation convention data compression scheme called multistep vector quantization adopting neural perspective suggests promising variants approach vari result viewing encoding task optimization problem finding weights minimize equation mention variants methods finding solution efficiently consistently final powerful extension algorithm studied vector quantization literature avoiding local optima rumelhart zipser noted competitive learning experiences problem locally optimal solutions competitive unit captures input patterns capture eliminate situations introduced secondary error term purpose force competitive units equally activity competitive unit trials based soft tive learning model yields weight update rule step size constraint part ultimate solu tion gradually reduced image compression simulation initially decreased training trials principled solution local optimum problem leaky learning idea suggest rumelhart zipser alternative schemes proposed selecting initial code weights vector quantization ture discovering discrete distributed representations constraints weights explored idea increase likelihood converging good solution achieve rapid convergence idea based facts solution weight vector competitive unit inputs captured unit rise observation stage input competitive pools units facts lead strong constraint weights input vector stage pattern part part clusters input patterns partitioned competitive units stage number elements cluster consequence optimal solution property observed figure constraining weights manner forming gradient descent ratio weight parameters quality solution convergence rate dramatically improved generalizing transformation stages stage winning competitive unit specifies transformation obtain transformation simply translation reason generalized include rotation dilation transformation matrix includes translation notation formally correct augmented element stant translations rotation dilation parameters learned gradient descent search error measure equation recon struction involves inverting sequence transformations simple situation generalized transformation depicted figure subtracting component detected stage clusters rotated alignment allowing stage capture remain mozer variation input extension proves test connectivity patterns figure suggest variations orientation permit compact representation input data figure sample input space data points acknowledgements research supported grant grant james mcdonnell foundation paul smolensky helpful comments work cottrell providing image data software references granger lynch simulation performs hierarchical cluster science bridle training stochastic model recognition algorithms networks lead maximum mutual information estimation parameters touretzky advances neural information cessing systems mateo morgan kaufmann cottrell munro zipser image compression back propagation programming models cognition review cognitive science durbin april principled competitive learning unsupervised supervised networks post presented conference neural networks computing snowbird gray vector quantization grossberg adaptive pattern classification universal parallel development neural feature detectors biological cybernetics unsupervised learning backward inhibition proceedings eleventh international joint conference artificial intelligence morgan kaufmann kohonen clustering topological maps patterns lang proceedings sixth international conference pattern recognition spring ieee computer society press rumelhart press connectionist processing learning statistical inference chauvin rumelhart backpropagation theory architectures applications hillsdale baum rumelhart zipser feature discovery competitive learning cognitive science sanger optimal unsupervised learning singlelayer linear feedforward neural network networks malsburg selforganization orientation sensitive cells striate cortex
2 baird associative memory simple model oscillating cortex bill baird dept molecular cell biology berkeley abstract generic model oscillating cortex assumes minimal coupling justified anatomy shown function memory previously developed theory network explicit excitatory neurons local inhibitory interneuron feedback forms nonlinear oscillators coupled long range connections local learning rule primary higher order synapses ends long range connections system learns store kinds lation amplitude patterns observed olfactory visual cortex rule derived general projection algorithm recurrent analog networks analytically guarantees content addressable memory storage continuous periodic sequences capacity fourier components node network spurious attractors introduction sketch recent results work discussed completely patterns oscillation observed large scale activity olfactory cortex visual neocortex shown predict olfactory visual pattern recognition responses trained animal appears cortical computation general occur dynamical interaction resonant modes thought case olfactory system sensitivity neurons location arrival times dendritic input associative memory simple model oscillating cortex pulses generated collective oscillation ideal formation reliable range transmission collective activity cortical area oscillation serve function relevant microscopic activity cortical regions defined phase coherent macroscopic collective states uncorrelated microscopic activity view correct oscillatory network modules form actual cortical substrate diverse sensory motor cognitive operations studied static networks ultimately shown functions accomplished dynamic networks interested modeling category learning object recog nition feature preprocessing equivalence classes ratios feature outputs feature space established prototype objects categories invariant sensory instances categories world kind function generally hypothesized cortex olfactory cortex visual system oscillatory network function feature binding clustering role hypothesized phase labels primary visual cortex decision states hypothesized olfactory bulb hopfield preprocessing systems modification connections learning perceptual objects category learning full adaptive cross coupling required input feature vectors potential attractors kind anatomical structure characterizes infer cortex columns structured fiber system prominent primary cortex shares high level association cortex structure cats rats preprocessing structures primary cortex grown evolved give expanded capabilities pattern recognition power contributed feature preprocessing developed object clas siftcation system learning underlie daily conceptual evolution phenomenon ultimate interest work minimal model oscillating cortex analog state variables recurrence oscillation bifurcation hypothesized essential features cortical networks explore approach explicit modeling excitatory inhibitory neurons long range connections basic requirement biologically feasible network architecture analyse minimal model intended assume coupling justified anatomy simulations analytic results proved argue oscillatory associative memory function realized system network meant real biology designed reveal general mathematical principles mechanisms actual system function principles observed applied contexts baird long range excitatory excitatory connections connections olfactory connections neocortex units neural populations density full cross coupling exists weights average synaptic strengths connections problem population level coupling symmetry average connection emerging operation outer product learning rule initially random connections network units neuron pools analog state variables arise naturally continuous local pulse densities cell voltage averages smooth sigmoidal population inputoutput functions slope increases arousal animal measured olfactory local inhibitory interneurons ubiquitous feature anatomy cortex make long range connections connections interconnections left minimal model resulting network fair studied circuitry olfactory cortex thought cases real biological network associative memory function neocortex complicated roughly viewed olfactory stacked expect analysis system lend insight mechanisms associative memory show model capable storing complicated spatiotemporal trajectories argue serve model memory sequences actions motor cortex dimensional system minimal coupling structure math matrix matrix excitatory interconnections identity matrices multiplied positive give strength coupling local inhibitory feedback loops state vector composed local average cell voltages excitatory neuron populations inhibitory neuron populations standard network equations coupling component form sigmoidal function symmetric inhibitory units receive direct input give direct output hidden units create oscillation amplitude patterns stored excitatory viewed simple eralization analog hopfield network architecture store periodic static attractors associative memory simple model oscillating cortex expand network order series origin network sigmoid symmetric symmetry order terms expansion vanish leaving cubic terms nonlinearity actual expansion sigmoids coordinate system give cubic terms form competitive negative cubic terms general directly programmable nonlinearity independent linear terms serve create multiple periodic attractors causing oscillatory modes linear term compete sigmoidal linearity static modes hopfield network intuitively terms thought maxima saturation landscape stored linear modes positive eigenvalues expand positioning directions eigenvectors modes make stable precise definition landscape strict function special polar coordinate success storing multiple oscillatory attractors sigmoid learning rule driven effective higher order biological model physiological point view considered model biological network operating linear region axonal sigmoid sigmapi units higher order synaptic nonlinearities biological justification higher order synapses long range excitatory connections higher order synaptic weights realized locally tion tiny fibers dense exact circuitry impossible investigate present experimental techniques single axons multiple branches contribute separate synapses dendrites target cells neighboring synapses dendrite interact nonlinear fashion modeled higher order synaptic terms researchers suggested dense crossing combination axons vicinity dendritic branch neuron neuron pool factors stimulated axons dendrite axons form cluster nearby synapses dendrite realize product synapse required higher order terms created process competitive cubic cross terms viewed physiologically complicated nonlinear processing decision making nonlinearity system baird opposed usual sigmoidal nonlinearity weights cubic synaptic terms network nonlinearity programmed detail analysis real eigenvectors give magnitudes complex eigenvectors theorem real eigenvalue matrix corre sponding eigenvector pair complex conjugate eigenvalues complex conjugate pair eigenvectors proof theorem amplitude phase patterns convert magnitude phase representation dividing common factor magnitudes eigenvectors display amplitude patterns interest restricted coupling oscillations network standing waves phase constant kind neuron differs basically observed olfactory bulb primary olfactory cortex cortex phase inhibitory components bulb lags phase excitatory components degrees easy choose model phase lags degrees learning projection algorithm theory detailed program linearly independent eigenvalues eigenvectors projection operation desired eigenvectors columns diagonal matrix desired eigenvalues complex eigenvectors follow associative memory simple model oscillating cortex learned form projection matrix eigenvectors columns forming matrix complex eigenvalues blocks diagonal project directly general cubic terms specific projection operation added network equations linear terms complex modes eigenvectors linearization analytically guaranteed projection characterize periodic attractors network vector field chosen normal form projected higher order synaptic weights general cubic terms operations constitute normal form projection algorithm member pair complex eigenvectors shown suffice eigenvector entered matrix projection operation real imaginary component columns expression periodic attractor established pattern matrix projection algorithm general cubic terms require long range inhibitory connections simulations oscillator networks reveal higher order terms anatomically justified long range excitatory connections cubic effective storing randomly chosen sets desired patterns behavior network close theoretical ideal guaranteed network general higher order terms stored oscillatory patterns reduced coupling general analytic justification normal form guarantees choices weights found projection operation general find work shows perturbation theory calculation normal form coefficients general high dimensional cubic nets tractable principle permits removal higher order weights produced projection algorithm incorporated improved learning rule requires fewer excitatory higher order weights exploring size neighborhood state space origin rule effective lead rigorous proof performance networks learning local hebb rules show orthonormal static patterns projection operation matrix reduces outer product hebb rule baird projection higher order weights multiple outer product rule rule guaranteed establish desired patterns eigenvectors matrix eigenvalues rule higher order weights cubic terms ensure patterns defined eigenvectors attractors network outer product local synapse rule synapse additive incremental learning system selforganizing modify based activity rank coupling matrix grows memories learned hebb rule capacity appears degenerate subspace eigenvalues flow directed regions state space patterns stored minimal real eigenvectors learned converted network structure standing wave oscillations constant phase absolute eigenvectors amplitudes mathematical perspective eigenvectors permutations signs components lead positive amplitude vector means amplitude patterns stored hebb rule excitatory connections ways find perfectly orthonormal eigenvectors stores amplitude vectors complexity dendritic processing discussed previously impossible distribution signs final effect synapses excitatory neurons biological system make mathematical degree freedom input objects feature preprocessing primary secondary sensory cortex expected outputs object recognition systems modeled rules patterns eigenvectors longer directly hebb rule expect kind performance found hopfield networks memories obtain reduced capacity automatic clustering similar exemplars investigation unsupervised induction categories training examples subject future architectural variations olfactory bulb model biologically interesting architecture store kinds patterns excitatory inhibitory plausible model olfactory bulb primary olfactory cortex experimental work freeman suggests associative memory function cortex evidence long range excitatory excitatory coupling olfactory bulb weaker cortex long range excitatory connecting halves bulb anatomical data show axons entering inhibitory cell associative memory simple model oscillating cortex layers eigenvectors polar form inhibitory population model additional term appears subtracted real part complex eigenvalues added extensions line analysis lateral inhibitory inhibitory excitatory feedback connections block coupling matrix matrix similarly full excitatory excitatory full excitatory inhibitory coupling blocks considered conjecture phase restrictions minimal model relaxed degrees freedom traveling waves exist acknowledgements supported acknowledge support freeman invaluable assistance morris hirsch references baird bifurcation theory approach vector field programming periodic attractors proc joint conf neural networks page june baird bifurcation learning network models oscillating cortex forest editor proc conf emergent computation baird bifurcation theory approach analysis synthesis neural networks engineering biological modelling research notes neural computing springer freeman mass action nervous system academic press york grey singer stimulus dependent neuronal oscillations visual cortex area neuroscience lewis haberly james bower olfactory cortex model circuit study associative memory neuroscience
9 bayesian model comparison monte carlo chaining david barber christopher bishop neural computing research group aston university birmingham abstract techniques bayesian inference applied great success problems neural computing including evaluation regression functions determination error bars predictions treatment hyperparameters problem model comparison challenging current techniques significant limitations paper show extended form markov chain monte carlo called chaining provide effective estimates relative probabilities models present results robot problem compare results obtained standard gaussian approximation framework bayesian model comparison bayesian treatment statistical inference state knowledge values parameters model terms probability distribution function initially chosen prior distribution combined likelihood function bayes theorem give posterior distribution form data predictions model obtained performing integrations weighted posterior distribution barber bishop comparison models based relative probabilities expressed bayes theorem terms prior probabilities give requires evaluate model evidence corresponds denominator relative probabilities models select single probable model form committee models probabilities convenient write numerator form error function normalization posterior distribution requires generally straightforward evaluate extremely difficult evaluate model evidence posterior distribution typically small narrow regions highdimensional parameter space unknown apriori standard numerical integration techniques approach based local gaussian approximation mode posterior mackay approximation expected accurate number data points large relation number parameters model fact complex models problems data bayesian methods offer neal argued bayesian perspective reason limit number parameters model computational reasons approach evaluation model evidence overcomes limitations gaussian framework additional techniques references bayesian model comparison chaining suppose simple model evaluate evidence easily generate sample distribution evidence model expressed form monte carlo approximation poor error functions significantly exponent dominated regions small samples small regions simple monte carlo approach yield poor results problem equivalent evaluation free energies statistical physics bayesian model comparison monte carlo chaining challenging problem number approaches developed neal discuss approach problem based chain successive models interpolate required evidence written ratios evaluated goal devise chain models successive pair models probability distributions close ratios evaluated accu keeping total number links chain fairly small limit computational costs chosen technique hybrid monte carlo neal sample distributions shown effective sampling complex distributions arising neural network models neal involves introducing hamiltonian equations motion parameters augmented momentum variables integrated method trajectory parameter vector accepted probability governed metropolis criterion replaced gibbs sampling check software implementation chaining evaluated evidence mixture gaussian distributions obtained result analytical solution application neural networks application chaining method regression problems involving neural network models network corresponds function data consists pairs input vectors targets assuming gaussian noise target data likelihood function takes form hyperparameter representing inverse noise variance networks single hidden layer tanh units linear output units neal diagonal gaussian prior weights divided groups inputtohidden weights biases hiddentooutput weights output biases group governed separate precision hyperparameter prior takes form normalization coefficient hyperparameters governed gamma distributions form barber bishop variance chosen give broad reflection limited prior knowledge values hyperparameters hybrid monte carlo algorithm sample joint distribution parameters hyperparameters evaluation evidence ratios parameter samples perform integrals hyper parameters analytically fact gamma distribution conjugate gaussian order apply chaining problem choose prior reference tribution define intermediate distributions based parameter governs effective contribution data term arises likelihood term corresponds prior select values interpolate reference distribution desired model distribution evidence prior easily evaluated analytically gaussian approximation comparison method chaining framework mackay based local gaussian approximation posterior distri bution approach makes evidence approximation inte hyperparameters approximated setting specific values determined maximizing evidence functions leads hierarchical treatment lowest level maximum posterior distribution weights found fixed values hyper parameters minimizing error function periodically hyperparameters evidence maximization evidence obtained analytically gaussian approximation reestimation formulae total number parameters group denotes trace parameters weights updated loop minimizing function conjugate gradient optimizer hyperparameters periodically training complete model evidence evaluated making gaussian approximation converged values hyperparameters distribution analytically model evidence assuming sufficiently broad effect location evidence maximum neglected bayesian model comparison monte carlo chaining number hidden units terms account equivalent modes posterior distribution arising hidden unit symmetries network model derivation results found bishop pages result corresponds single mode distribution initialize weight optimization algorithm random values find distinct solutions order compute evidence network model number hidden units make assumption found distinct modes posterior distribution precisely arrive total model evidence possibility solutions found related symmetry transformations account missed important modes attempt made detect degenerate solutions difficult framework gaussian approximation results robot problem illustration evaluation model evidence problem modelling forward kinematics robot twodimensional space introduced mackay problem chosen mackay reports good results gaussian approximation framework evaluate good opportunity comparison chaining approach task learn mapping data consists inputoutput pairs outputs corrupted gaussian noise standard deviation original training data mackay generated test points evidence evaluated chaining gaussian approximation networks numbers hidden units chaining method form gamma priors precision variables inputtohidden weights biases hiddentooutput weights output biases noise level hyperparameters settings follow closely neal problem hiddentooutput precision scaling chosen neal limit infinite number hidden units defined corresponds gaussian process prior evidence ratio chain samples hybrid monte carlo obtained trajectory length iterations omitted give algorithm chance reach equilibrium distribution samples obtained trajectory length evaluate evidence ratio figure show error values sampling stage hidden units errors largely uncorrelated required effective monte carlo sampling figure plot values note large change evidence ratios beginning chain sample close reference distribution barber bishop figure error plotted successive monte carlo samples values ratio reason choose dense close principled approaches partitioning selection figure shows model evidence number hidden units note chaining approach computationally expensive complete chain takes hours matlab implementation running silicon graphics challenge evidence number hidden units grows correspondingly figure test error performance degrade number hidden units increases overfitting increasing model complexity accordance bayesian expectations results gaussian approximation approach shown figure characteristic occam hill evidence shows peak strong decrease smaller values slower decrease larger values test errors similarly show minimum indicating gaussian approximation increasingly inaccurate complex models discussion chaining effective evaluation model neural networks monte carlo techniques find peak model evidence test error number hidden units increased indication fitting accord expectation model complexity limited size data marked contrast conventional figure plot numbers hidden units test error number hidden units theoretical minimum test error bayesian model comparison monte carlo chaining figure plot model evidence robot problem versus number hidden units gaussian approximation framework shows characteristic occam hill shape note evidence computed additive constant origin vertical axis significance plot test error versus number hidden units individual points correspond modes posterior weight distribution line shows test error maximum likelihood viewpoint consistent result limit infinite number hidden units prior network weights leads welldefined gaussian prior functions williams important advantage make accurate evaluations model evidence ability compare distinct kinds model radial basis function networks multilayer perceptrons chaining models back common reference model evaluating normal ized model explicitly acknowledgement chris williams bruce number discussions work supported epsrc grant devel learning theory neural networks references bishop neural networks oxford university press hybrid monte carlo physics letters richardson markov chain monte carlo practice chapman hall bayes factors statist mackay practical bayesian framework backpropagation works neural computation neal probabilistic inference markov chain monte carlo methods technical report department computer science university toronto neal bayesian learning neural networks springer lecture notes statistics williams computing infinite networks volume
8 independent component analysis data scott makeig naval health research center diego scott anthony bell computational neurobiology salk institute diego jung naval health research center computational neurobiology salk institute diego terrence sejnowski howard hughes medical institute computational neurobiology salk institute diego abstract distance brain differ data collected point human scalp includes activity generated large brain area spatial data volume conduction involve significant time delays independent component analysis algorithm bell sejnowski suitable performing blind source data algorithm separates problem source identification source localization results applying algorithm eventrelated potential data collected sustained auditory detection task show training insensitive random seeds obvious components line muscle noise movements sources capable overlapping phenomena including theta bursts components separate channels behav state tracked amount residual correlation output channels makeig bell jung sejnowski introduction separating source analysis joint problems source segregation identification localization difficult problem determining brain electrical sources patterns recorded scalp surface mathematically efforts identify sources focused performing spatial segregation localization source activity applying algorithm bell sejnowski attempt completely separate problems source identification source localization algorithm derives independent sources highly correlated signals statistically regard physical location configuration source generators modeling unitary output multidimensional system independent microscopic generators suppose output number statistically independent spatially fixed systems restricted widely distributed independent component analysis independent component analysis techniques finding matrix vector elements linear transform random vector independent contrast decorrelation techniques principal components analysis ensure imposes stronger criterion multivariate probability density function finding factorization involves mutual information mutual information measure depends higherorder statistics decorrelation takes account statistics algorithm proposed carrying prior assump tion unknown independent components form cumulative density function scaling shifting form call performed maximizing entropy nonlinearly transformed vector yields stochastic gradient ascent rules adjusting elements shown solution stable point relaxation practical tests separating mixed speech signals good results found logistic function speech signals case algorithm simple form results obtained fact speech signals matched gradient logistic function experiments paper speedup technique independent component analysis data applying data technique appears ideally suited performing source separation sources independent propagation delays mixing medium negligible sources analog pdfs unlike gradient logistic sigmoid number independent signal sources number sensors meaning employ sensors algorithm separate sources case signals scalp electrodes pick correlated signals independent brain sources generated mixtures assume complexity dynamics modeled part collection modest number statistically independent brain processes source analysis problem satisfies assumption volume conduction brain tissue effectively instantaneous assumption satisfied assumption plausible assumption linear mixtures sources questionable effective number statistically independent brain signals contributing recorded scalp problem interpreting output determining proper dimension input channels physiological andor ical significance derived source channels model ignores variable synchronization separate generators common subcortical corticocortical influences appears promising identifying concurrent signal sources close widely distributed separated current tion techniques report application algorithm analysis recordings sustained performance auditory detection task give evidence suggesting algorithm identifying state transitions methods behavioral data collected develop method alertness operators complex systems adult sessions pushed detected auditory target stimulus level background noise maximize chance observing alertness sessions conducted small experimental chamber subjects instructed eyes closed auditory targets increases intensity white noise background threshold presented random time intervals rate superimposed continuous train steadystate response short probe tones frequencies target noise bursts intervals collected electrodes located sites international system referred sampling rate bipolar diagonal channel recorded movement artifact correction rejection hits defined targets responded window targets responded sessions subjects selected analysis based response continuous performance measure local error rate computed convolving performance index time series smoothing window advanced steps makeig bell jung sejnowski algorithm applied recordings time index permuted ensure signal stationarity time point vectors presented network time speed conver gence data remove secondorder statistics learning rate annealed convergence pass training checked amount correlation output channels amount change weight matrix stopped training procedure correlation channel pairs weights stopped changing results small portion resulting time series shown figure expected correlations traces close dominant theta wave spread channels left panel isolated trace upper epoch shown session alpha activity obvious data trace session alpha bursts quiescent periods traces oscillatory bursts easy characterize display dynamics activity trace dominates record trace slow movements frontal channels trace line noise traces broader high frequency spectrum suggesting source highfrequency activity generated scalp muscles apparently source solution data depend strongly learning rate initial conditions portion session train networks random starting weights data presentation orders learning rates final weight matrices close filtering segment data session matrix produced source transforms correlated output channel pairs correlated correlated training minimized mutual information correlations output channels initial alert training period output data channels filtered weight matrix correlated portion session initial levels decorrelation subject alert conversely filtering sessions data weight matrix trained portion sion produced output channels correlated alert portions session training period residual correlation outputs reflect dynamics topographic structure signals alert brain states important problem human determine means tively identifying overlapping figure panel shows decomposition left panel erps detected targets subject spatial filtering produces channels separating steadystate response produced continuous stimulation session note tion amplitude previously identified channels pass components detected target response independent component analysis data figure left seconds data transform data weights trained minutes similar data session makeig bell jung sejnowski scalp erps erps detected targets targets figure left panel eventrelated potentials erps response bold traces detected traces noise targets sessions panel signals filtered weight matrix trained data independent component analysis data components larger target response suggest represent time focal distributed brain response activity represent solution problem dividing evoked responses meaningful temporally overlapping components conclusions appears promising analysis tool human research wide range artifacts output channels removing remaining channels turn represent time activity transient independent brain sources algorithm reliably incorporating higherorder statistical information avoids decompositions algorithm appears decomposing evoked response data spatially measures source solution observing brain state acknowledgments report supported part grant naval health research center office naval research views expressed article authors reflect official policy position department department defense government bell supported grants office naval howard hughes medical institute references bell sejnowski approach blind separation blind deconvolution neural computation bell sejnowski fast blind separation based informa tion theory proc syrup nonlinear theory applications comon independent component analysis concept signal processing sereno source localization linear approach neurosci makeig dynamic steadystate poten dynamics sensory cognitive processing brain makeig eventrelated perturbations steadystate responses bullock brain dynamics progress perspectives jung makeig sejnowski estimating alertness power spectrum submitted publication makeig alertness coherence fluctuations performance spectrum clin
5 diffusion approximations constant learning rate backpropagation algorithm local minima william finnoff siemens corporate research development munich germany abstract paper discuss asymptotic properties variant backpropagation algorithm work weights trained means local gradient descent amples drawn randomly fixed training learning rate gradient updates held constant simple gation stochastic approximation results show training process approaches batch training results rate convergence show small approximate simple back propagation batch training process gaussian diffusion unique solution linear stochastic differential equation approximation reasons simple backprop agation stuck local minima batch training process demonstrate empirically number examples introduction original simple backpropagation algorithm incorporating pattern pattern learning constant learning rate remains spite real finnoff widely network training algorithm vast body literature documents general applicability robustness paper draw highly developed literature stochastic approximation demonstrate asymptotic properties simple backpropagation close relationship backpropagation stochastic approximation methods long recognized properties algorithm case decreasing learning rate shown white darken moody comparable results algorithm constant learning rate derive weak convergence results part paper show simple backpropagation asymptotic dynamics batch training small learning rate limit expected batch training expected simple backpropagation long learning rate algorithm small special situation considered contrast provide result speed convergence expected batch training expected simple backpropagation long learning rate algorithm small part paper gaussian approximations difference actual training process limit derived shown difference properly solution linear stochastic differential equation final section paper combine results provide approximation simple backpropagation training process show simple backpropagation stuck local minima batch training ability avoid local minima demonstrated empirically examples notation define parametric version single hidden layer network activation function inputs outputs hidden units setting transpose denotes number weights network training exam ples consisting targets inputs define parametric error function gradient diffusion approximations constant learning rate backpropagation algorithm approximation asymptotic properties network training processes induced starting gradient direction function learning rate rand infinite training sequence drawn random defines discrete parameter process weight updates setting continuous parameter process setting question investigate small learning rate limit continuous parameter process limiting properties family show family stochas processes converges probability limit process denotes solution gradient equation constant solution deterministic result corresponds large numbers weight update process small learning rate limit averages stochastic fluctuations central application stochastic approximation results derivation subject local lipschitz linear growth bounds exists constant exists constant finnoff proof calculations result based ward making repeated fact products sums locally lipschitz continuous functions locally lipschitz continuous provide explicit values constants denoting resp probability resp mathematical expectation processes defined results probability deviations process limit exists constant doesnt depend proof part proof requires finds bounds accomplished results lemma places bounds remainder proof part required conditions follow directly hypotheses variables condition trivially fulfilled noted constant dependent increase exponentially show training process remains bounded region necessarily exclusively difference stochastic approximation discrete parameter gradient process error discrete approximation continuous parameter versions gaussian approximations section give gaussian approximation difference training process limit limit coincide training process limit stochastic fashion gaussian approximation estimate size nature diffusion approximations constant learning rate backpropagation algorithm fluctuations depending order statistics matrix weight update process define resp denote coordinate vector resp define represents covariance matrix random elements define symmetric matrix valued matrix property result represents central limit theorem training process permits type order approximation fluctuations stochastic training process deterministic limit assumptions distributions processes converge weakly sense weak convergence measures uniquely defined measure denotes solution stochastic differential equation denotes standard ddimensional brownian motion covariance matrix equal identity matrix proof proof part noted proof hypotheses conditions fulfilled define hypotheses continuous order derivatives function fulfill remaining requirements trivial consequence definition finally setting derived directly definitions local minima section combine results preceding sections provide gaussian approximation simple backpropagation recalling results finnoff notation approximation small learning rate simple backpropagation batch learning produce essentially results stochastic portion process controlled negligible stochastic element training process approximated gaussian diffusion diffusion term simple backpropagation character gradient continuously perturbed gaussian term allowing escape local minima small shallow basins attraction noted rest term convergence rate calculation exact rates require generalized version theorem knowledge results applicable situation empirical results simple backpropagation local minima part neural network examples single hidden layer feedforward network tanh hidden units output trained simple backpropagation batch training data gener ated models data consisted pairs targets inputs experiment based additive structure form product struc ture struc ture considered constructed sums radial basis functions points chosen independent uniform distribution final experiment conducted data generated feedforward network activation function details construction examples model training runs made vector starting weights simple backpropagation batch training batch training process stuck local minimum producing worse results found simple backpropagation wide array structures generate data number data sets hard observed phenomena dependent diffusion approximations constant learning rate backpropagation algorithm error simple epochs error product mapping batch sums rbfs batch sums error finnoff references adaptive algorithms stochastic springer verlag approximation thesis paris french darken moody note learning rate schedules stochastic optimization neural information processing systems mann moody touretzky morgan kaufmann mateo improving model selection methods neural networks hornik convergence learning algorithms constant learning rates ieee neural networks white asymptotic results learning single hiddenlayer feedforward network models jour amer star white learning artificial neural networks statistical perspective neural computation
11 orientation scale discontinuity emergent properties illusory contour shape research institute independence princeton williams dept computer science university mexico albuquerque abstract recent neural model illusory contour formation based distribution natural shapes traced particles moving constant speed directions brownian motions input model consists pairs position direction constraints output consists distribution contours joining pairs general contours closed distribution paper show compute distribution closed contours position constraints result explain illusory contour effect introduction proposed distribution illusory contour shapes modeled particles travelling constant speed directions brownian motions recently williams introduced notion stochastic completion field distribution particle trajectories joining pairs position direction constraints showed computed local parallel network argued mode magnitude variance completion field related observed shape salience illusory contours williams jacobs model ings recent psychophysics suggests contour salience greatly enhanced general distribution computed williams jacobs model consist closed contours distances constraints produce comparable completion field double size doubling particles speeds williams jacobs model intrinsic mechanism speed selec tion speeds directions priori paper show compute distribution closed contours position constraints technical details shape distribution consistent earlier paper tribution assume distribution completion shapes consisting modified random impulses drawn mixture limiting distributions distribution consists weak frequently acting impulses call tribution weak impulses variance equal weak impulses poisson times rate distribution consists strong acting impulses call magnitude random impulses gaussian distributed variance equal strong impulses poisson times rate particles decay equal param eter effect particles tend travel smooth short paths occasional orientation discontinuities position velocity constraints conditional probability particle beginning reach note transition probabilities symmetric symmetry matrix transition probabilities compute relative number closed contours satisfying position velocity constraint begin noting randomness increasingly smaller smaller contours satisfy increasing numbers constraints contours start suppose relative number contours satisfy straints general suppose compute eigenvector largest real positive eigenvalue implies number constraints satisfied increases number contours remaining sample interest decreases ratios remain invariant letting pass infinity relative number contours summarize started contours left pairs constraints solving relative numbers refer components stochastic completion field emergent properties illusory contour shape stochastic completion fields note represent distribution closed contours fact majority contours contributing satisfy single additional constraint recurrence equation number contours begin constraint constraint satisfy intermediate constraints recurrence equation define expression relative number contours length begin constraint result theory positive show expression simply left eigenvectors largest positive real eigenvalue symmetry left eigenvectors related permutation opposite directions finally compute relative number closed contours arbitrary position velocity plane compute stochastic completion field arbitrary position velocity plane relative probability closed contour pass note natural generalization williams factorization completion field product source fields restriction particles constant speed transition probability matrix block corresponds speed components eigenvector confined single block function solve largest positive real eigenvalue speed maximized eigenvector limiting distribution spatial scales experiments point circle points spaced uniformly circle diameter find distribution directions point completion field figure left order directions speed priori experiments sample direction intervals discrete directions pairs size parameters defining distribution completion shapes simplicity assume pure case point sizes figure left position constraints order direc tions speed priori eigenvector represents distribution spatial scales product orientations tangent circle dominate distribution closed contours stochastic completion field plot magnitude maximum positive real eigenvalue point circle solid dashed figure observers report width arms increases shape illusory contour circle evaluated velocity interval standard numerical routines plotted magnitude largest real positive eigenvalue amax function reaches maximum eigenvector represents limiting distribution spatial scales figure scaled test figure factor plotted interval figure observe plotted logarithmic xaxis functions identical translation confirms size figure results doubling selected speed koffka cross cross stimulus figure basic degrees freedom call diameter width figure interested emergent properties illusory contour shape figure koffka cross showing diameter width orientation position constraints terms normal orientation endpoint solid lines dashed lines represent minus standard deviation gaussian weighting function typically perceived square typically perceived circle positions line endpoints stochastic completion field parameters varied observers report width arms increases shape illusory contour circle endpoints lines comprising koffka cross define position orientation constraints figure position constraints terms parameters orientation constraints form gaussian weighting function assigns higher probabilities contours passing endpoints orientations normal lines prior probabilities assigned position direction pair gaussian weighting function form diagonal matrix transition probability matrix random process scale eigenvalue eigenvector largest positive real eigenvalue scale maximized eigenvector limiting distribution spatial scales koffka cross evaluated velocity interval standard numerical routines function reaches maximum figure left observe completion field eigenvector dominated contours predominantly circular shape figure uniformly scaled koffka cross figure factor figure perceived square figure perceived circle positions line endpoints orientations lines affect percept chosen model dependence gaussian weighting function favors contours passing endpoints lines normal direction motivate based statistics natural scenes distribution relative orientations contour crossings maximum drops parameters defining distribution completion shapes measure transition probabilities averaged initial conditions modeled gaussians variance williams crosses sizes figure left plot magnitude maximum positive real eigenvalue koffka crosses solid dashed completion field eigenvector plotted interval figure left observe confirms scale invariance system studied relative magnitudes local maxima change parameter varied begin koffka cross observe local maxima figure left refer larger maxima circle previously noted imum located approximately maximum located approximately completion field eigenvector rendered observe distribution dominated contours predominantly square shape figure reason refer local maximum square cross widths arms doubled diameter remains observe local maxima approxi approximately figure left render completion fields eigenvectors find completion fields general char contours smaller spatial scale lower speed approximately circular larger spatial scale higher speed approximately square figure refer locations respective local maxima circle square interesting relative magnitudes local maxima reversed previously observed circle observe circle completion field eigenvector represents limiting distribution spatial scales consistent transition circle square reported human observers widths arms koffka cross increased emergent properties illusory contour shape crosses widths figure plot magnitude maximum positive real eigenvalue amax koffka crosses solid dashed stochastic completion fields koffka cross local optimum global optimum global optimum local optimum results consistent transition perceived human subjects width arms koffka cross increased conclusion improved previous model illusory contour formation show compute distribution closed contours position constraints model explain previously perceptual effect references horn johnson matrix analysis cambridge univ press closed curve incomplete effect closure figureground segmentation proc natl acad mumford computer vision algebraic geometry applica tions springerverlag york angular margins gradients journal psychology williams analytic solution stochastic completion fields biological cybernetics williams characterizing distribution completion shapes corners mixture random processes intl workshop energy minimization methods computer vision italy williams jacobs stochastic completion fields neural model illusory contour shape salience neural computation williams jacobs local parallel computation stochastic fields neural computation
10 error coding substitution james trevor hastie department statistics stanford university abstract class classification techniques recently statistics machine learning literature clas siftcation technique pact method takes standard classifier trees algorithm produce classifier standard classifier classi methods produce large improvements single classifier paper investigate methods give motivation success introduction dietterich suggested method motivated error correct coding theory solving class classification problems binary classifiers produce large binary coding matrix matrix zeros denote matrix component column column coding matrix create super groups assigning groups element super group groups super group train classifier class problem repeat process columns produce trained classifiers test point apply classifiers classifier produce estimated probability test point super group produce vector probability estimates error coding substitution classify point calculate groups distance classify group lowest distance equivalently rain call ecoc pact coding matrix corresponds unique minimal coding class motivation allowed errors individual classifiers corrected small number classifiers gave influence final classification tested results obtained trees experiments paper stated standard cart note theorems general past assumed improvements shown method error coding structure effort devoted choosing optimal coding matrix paper develop results suggest randomized coding matrix match exceed performance designed matrix coding matrix empirical results dietterich suggest ecoc pact duce large improvements standard class tree classifier shed light case answer question explore probability structure coding matrix central pact past usual approach choose large separation rows terms hamming distance basis largest number errors corrected sections examine tradeoffs designed deterministic completely randomized matrix results follow make assumption posterior probability test observation group predictor variable assumption states average classifier estimate probability super group correctly assumption trees considered bias deterministic coding matrix notice classify identical ecoc pact theorem section explains intuitive transformation pact outperform bayes classifier hope achieve bayes error rate bayes classifier class problem defined property bayes optimality bayes optimality essentially consistency states converges bayes classifier training sample size increases pact definition pact bayes optimal test classifies bayes group bayes classifier ecoc pact means points predictor space bayes classifier shown case james hastie clear expression guarantee fact theorem tells restricted circumstances ecoc pact bayes optimal theorem error coding method bayes optimal hamming distance pair rows coding matrix equal hamming distance binary vectors number points differ general generate matrix property ecoc pact bayes optimal random coding matrix previous section potential problems deter matrix suppose randomly generate coding matrix choosing equal probability coordinate training conditional expectation prove theorem theorem random coding matrix conditional words classification ecoc pact approaches classification leads corollary eliminated main concern deter matrix corollary coding matrix randomly chosen ecoc pact cally bayes optimal theorem consequence strong theorems provide motivation ecoc procedure theorem assumption randomly generated coding matrix tells unbiased estimate conditional probability classifying maximum sense unbiased estimate bayes classification theorem tells large ecoc pact similar classifying large depends rate convergence theorem tells rate fact exponential theorem randomly choose conditional fixed constant note theorem depend assumption tells error rate ecoc pact equal error rate term decreases exponentially limit result proved inequality upper bound error rate necessarily behavior smaller values conditions taylor expansion small values error coding substitution convergence figure curves rates expect smaller values error rate decreases power increases change exponential test hypothesis calculated error rates values data irvine repository machine learning points random point average random training sets figure illustrates results curves lower curve groups fits groups predicts errors groups upper curve groups fits groups predicts errors groups supports hypothesis error rate moving powers exponential figure values reduction error rate substantially remaining errors result error rate reduce changing coding matrix coding matrix viewed method sampling distribution sample randomly optimal estimate parameter random sampling improve designing coding matrix improve training data influence sampling procedure estimating quantity designed coding matrix training data improve random sampling procedure attempted past ecoc pact work motivate ecoc pact works case tree classifiers similar method call substitution pact show conditions ecoc pact similar substitution pact motivate success hastie substitution pact substitution pact coding matrix form trees ecoc pact transformed training data form probability estimate class problem original training data back tree training data form probability estimates classifications regular tree difference tree formed unlike ecoc pact tree produce probability estimate classes class simply average probability estimate class trees probability estimate substitution pact probability estimate group tree formed column coding matrix theorem shows conditions ecoc pact thought approximation substitution pact theorem suppose independent column coding matrix approaches infinity ecoc pact substitution pact converge give identical classification rules theorem depends unrealistic assumption empirically trees unstable small change data large change structure tree reasonable suppose correlation test empirically ecoc substitution simulated data data composed classes class distributed normal identity covariance matrix uniformly distributed means training data consisted observations group figure shows plot estimated probabilities classes test data points averaged training data sets points true posterior probability greater plotted groups insignificant probabilities affect classification groups producing identical estimates expect data points dotted case substitution pact systematically shrinking probability estimates clear linear relationship interested test point expect similar classifications fact fewer points correctly classified group substitution pact work fact average probability estimates suggests reduction vari ability explanation success substitution pact shown friedman reduction variance probability estimates necessarily correspond reduction error rate theorem simplifying assumptions relationship quantities exists theorem suppose error coding substitution ecoc figure probability estimates ecoc substitution identical joint distributions variance probability estimate group class tree method constants true posterior probability assumed constant theorem states fairly general conditions probability substitution pact classification bayes classifier great tree method provided standardized variability noted case groups direct correspondence error rate inequality strict common distributions normal uniform exponential gamma reason general small result empirical variability tree classifiers small change training large change structure tree final probability estimates changing super group coding expect probability estimate fairly unrelated previous estimates correlation test accuracy theory examined results simulation performed section estimate table summarizes estimates variance terms simulated classifi tree method james hastie tree ecoc substitution scale figure error rates simulated data tree method substitution pact ecoc pact plotted scale quantities give estimate derived estimate provided improvement substitution pact class tree classifier figure shows substitution error rate drops tree classifier point conclusion ecoc pact originally error coding ideas clas siftcation problems results error coding matrix simply method randomly sampling fixed distribution idea similar randomly sample empirical distribution fixed data estimate variability parameter estimate sources error randomness caused sampling empirical distribution randomness data case sources error error caused sampling estimate errors caused cases sort error reduce rapidly type interested motivate reduction error rate terms decrease variability provided large correlation small references dietterich solving multiclass learning problems error correcting output codes journal artificial intelligence research rich kong errorcorrecting output coding bias variance proceedings international conference machine learning morgan kaufmann friedman bias variance curse dimensionality dept statistics stanford university technical report hoeffding probability inequalities sums bounded random variables journal american statistical association march
12 learning sparse codes prior olshausen department psychology center neuroscience davis newton davis center neuroscience davis newton davis abstract describe method learning overcomplete basis functions purpose modeling sparse structure images sparsity basis function coefficients modeled distribution gaussian captures active coefficients distribution centered gaussians capture active coefficients distribution show prior form exist efficient methods learning basis functions parameters prior performance algorithm demonstrated number test cases natural images basis functions learned natural images similar obtained methods sparse form coefficient distribution parameters prior adapted data assumption sparse structure images made priori learned data introduction general problem address learning basis functions representing natural images efficiently previous work variety opti mization schemes established basis functions code natural images terms sparse independent components resemble wavelet basis basis functions spatially localized oriented bandpass order tile joint space position orienta tion manner yields image representations basis overcomplete basis functions exceeds dimensionality images coded major challenge learning overcomplete bases fact posterior distribution coefficients sampled learning posterior sharply peaked sparse prior imposed conventional sampling methods cumbersome olshausen approach dealing problems overcomplete codes sparse priors suggested form resulting posterior distribution coefficients averaged images shown posterior distribution coefficients overcomplete representation sparse prior imposed learning cauchy distribution dashed line coefficients imposed prior occupy states inactive state coefficient active state coefficient takes significant nonzero continuum suggests choice prior capable capturing discrete states figure posterior distribution coefficients cauchy prior approach modeling form sparse structure prior coefficients binary state variables determine coefficient active inactive state coefficient distribution gaussian distributed variance depends state variable important advantage approach regard sampling problems mentioned gaussian distributions analytical solution integrating posterior distribution setting state variables sampling binary state variables show problem tractable approach differs previously attias variational methods approximate posterior rely sampling adequately characterize posterior distribution coefficients model image modeled linear superposition basis functions coefficients gaussian noise expressed notation prior probability distribution coefficients factoffal distri bution coefficient modeled distribution gaussians binary state variables determine gaussian describe coefficients total prior sets variables form learning sparse codes prior gaussians binary state variables gaussians state variables figure prior determines probability active inactive states gaussian distribution variance determined current state total image probability parameters include diagonal covariance matrix elements notations explicitly reflect dependence means variances diagonal elements model illustrated graphically figure figure image model olshausen learning objective function learning parameters model average loglikelihood maximizing objective minimize lower bound coding length learning accomplished gradient ascent objective learning rules parameters takes values binary defined section note expressions dropped outer brackets averaging images simply reduce clutter image sample posterior order collect statistics needed learning statistics accumulated images parameters updated rules note approach differs attempt states variational approximation approximate posterior effectively summing states probable posterior conjecture scheme work practice posterior significant probability small fraction states small number samples present efficient method gibbs sampling posterior sampling inference order sample posterior cast boltzmann form learning sparse codes prior const argmin performed state variables accord binary binary case alternative states case denotes change changing note computations change state local involve terms index deciding change state computed quickly change state accepted update formula computation long accepted state rare found case sparse distributions gibbs sampling performed quickly efficiently addition generally sparse matrices system scaled number elements affected flip order code images model single state coefficients chosen image purpose estimator maximizing posterior distribution accomplished assigning gradually lowering state olshausen results test cases trained algorithm number test cases forms sparse bimodal structure critically sampled complete overcomplete basis sets training sets consisted pixel image patches created sparse superposition basis functions results test cases confirm algorithm capable correctly extracting sparse structure data shown lack space natural images trained algorithm image patches extracted natural images cases basis functions initialized random functions white noise prior initialized gaussian gaussians roughly equal variance shown figure results basis functions overcomplete case case prior initialized gaussians equal variance offset positions case sparse form prior emerged completely data resulting priors coefficients shown figure posterior distribution averaged images coefficients posterior distribution matches prior tails laplacian form appears extra complexity offered gaussians utilized gaussians move center position bimodal prior imposed basis function solution localized oriented bandpass sparse priors coding efficiency evaluated coding efficiency coefficients levels calculating total coefficient entropy function distortion intro duced quantization basis sets basis functions high overcomplete basis sets yield coding efficiency fact coefficients code point occurs appears point errors longer perceptually noticeable conclusions shown prior basis functions image model adapted natural images sparseness imposed model seeks distributions sparse learns basis functions distribution conjecture small number samples posterior sufficiently characterized appears hold cases aver ages collected gibbs sweeps sweeps initialization algorithm proved capable extracting structure challenging datasets high dimensional spaces overcomplete image codes lowest coding cost high levels levels higher practically hand learning sparse codes prior figure overcomplete basis functions priors vertical axis learned natural images priors learned mixture basis functions posterior distribution averaged coefficients rate distortion curve comparing coding efficiency learned basis sets marginal entropies true entropy coefficients considerably statistical dependencies coefficients case overcomplete bases show lower dependencies included model coupling term acknowledgments work supported grant references olshausen field sparse coding overcomplete basis strategy employed vision research bell sejnowski independent components natural images edge filters vision research independent component filters natural images compared simple cells primary visual cortex proc royal lond lewicki olshausen probabilistic framework adaptation comparison image codes simoncelli freeman adelson heeger multiscale transforms ieee transactions information theory attias independent factor analysis neural computation
11 linear hinge loss average margin italy manfred warmuth computer science department university california santa cruz abstract describe unifying method proving relative loss bounds line linear threshold classification algorithms perceptron winnow algorithms classification problems discrete loss total number prediction mistakes introduce tinuous loss function called linear hinge loss employed derive updates algorithms prove bounds linear hinge loss convert discrete loss intro duce notion average margin examples show relative loss bounds based linear hinge loss converted relative loss bounds discrete loss average margin introduction classical perceptron algorithm hypothesis algorithm trial linear threshold function determined weight vector instance linear activation passed threshold function arguments threshold prediction algorithm binary denote classes perceptron algorithm aimed learning classification problem examples form examples algorithm predicts instance algorithms prediction agrees label instance loss prediction label disagree loss call loss discrete loss convergence perceptron algorithm established perceptron convergence theorem classical algorithm learning linear threshold functions winnow algorithm littlestone algorithm maintains weight vector predicts linear threshold function defined current weight vector update weight vector supported grant performed algorithms perceptron winnow perceptron algorithm performs simple additive update parameter positive learning rate equals lies diction algorithm correct update occurs perceptron algorithm winnow update update prediction algorithm wrong algorithm perceptron subtract current weight similarly algorithm perceptron adds current weight interpret gradient loss function winnow gradient update logarithm weight vector rewrite update gradient appears exponents factors multiply weights factors correct weights direction algorithm algorithms good purposes generally speaking discussion framework introduced deriving simple online learning updates framework applied variety learning algorithms differentiable loss functions updates derived approximately solving minimization problem loss denotes chosen loss function setting discrete loss prediction algorithm discrete loss discontinuous weight vector return point parts minimization problem parameter learning rate mentioned importantly divergence measuring divergence function purposes motivates update potential function analysis prove loss bounds algorithm analysis context learning essentially back method deriving updates based divergence introduced divergence regularization term serve barrier func tion optimization problem purpose keeping weights region additive algorithms gradient descent perceptron algorithm divergence potential function proof perceptron convergence theorem multiplicative update algorithms winnow exponentiated gradient algorithms potential functions function minimized works loss function convex differentiable linear regression loss function square loss minimizing divergence update exponentiated gradient algorithms derived entropic case differentiate discrete loss discontinuous asked loss function motivates perceptron winnow algorithms framework loss function achieves continuous linear hinge loss average margin gradient call loss linear hinge loss tool understanding linear threshold algorithms perceptron winnow process changing discrete loss changed learning problem classification regression problem versions algorithm classification version regression version classification version predicts binary label linearly thresholded prediction loss function discrete loss regression version hand predicts instance linear activation classification problem labels examples regression problem labels versions algorithm rule update weight vector strong hint related perceptron winnow fact loss limiting case entropic loss logistic regression logistic regression threshold function replaced smooth tanh function technical associating matching loss function increasing transfer function matching loss tanh transfer function loss show making transfer function taking viewpoint matching loss entropic loss converges limiting case slope transfer function infinite threshold function question introduction prove unifying class general additive algorithms defined bounds regression versions perceptron winnow simple special cases loss bounds converted loss bounds classification problems discrete loss conversion carried working average margin examples relative linear threshold classifier conversion paper considered principled deriving average mistake bounds average margin reveals structure mistake bound results proven conservative online algorithms previously definitions deviation attribute error easily related average margin terms average margin linear hinge loss define subsets weight domain instance domain weights maintained algorithms weight domain instances examples instance domain require convex general additive algorithm divergence defined terms link function function vector valued function interior weight domain property jacobian strictly positive definite link function unique inverse assume gradient potential function easy extend domain includes boundary link function bregman divergence function defined difference order taylor expansion strictly positive definite jacobian potential strictly convex equality holding perceptron algorithm motivated identity link weight domain divergence winnow warmuth figure matching loss figure function cases weight domain link function logarithm divergence related link function unnormalized relative entropy note property immediately definition divergence lemma paper focus single neuron hard threshold transfer function beginning introduction view neuron ways standard view neuron binary classification outputs predict desired label threshold view neuron outputs linear activation predict classification discrete loss regression linear hinge loss parameterized threshold note arguments losses switched discussed easily shown convex gradient loss note values mentioned introduction strictly speaking gradient defined equals threshold show subsequent sections case properties figure graphical representation threshold function transfers linear activation prediction hard classification remaining discussion section assume loss generality threshold smooth transfer functions tanh commonly neural networks tanh relative loss bounds proven comparison class consists single neurons increasing differentiable transfer function work loss function matches transfer function loss defined figure matching loss square loss linear regression matching loss entropic loss logistic regression defined notation matching loss subscript stress connection matching loss divergence discussed section linear hinge loss average margin entropic loss finite tanh ranges needed logistic regression type loss classification threshold functions slope function increased limit threshold slope matching loss infinite slopes relative loss bounds based notion matching loss grow slope function impossible matching loss function threshold make sense matching loss viewing neuron matching loss rewritten bregman divergence function increase slope function tanh keeping fixed limiting case threshold loss hinge loss threshold finally observe views neuron related property bregman algorithms paper associate general additive algorithms link function classification algorithm regression algorithm algo rithms table correspond views linear thresh neuron discussed section brevity call gorithms classification algorithm regression algorithm classification algorithm instance prediction label update regression algorithm instance prediction label update discrete loss classification algorithm receives linear hinge loss regression algo rithm receives infinite label sign classification algorithm predicts regression algorithm linear activation loss classification algorithm discrete loss regression algorithm updates algorithms equivalent update regression algorithm motivated minimization problem setting gradient follow equilibrium equation holds minimum approximately solve equation placing meaning warmuth versions perceptron winnow obtained link functions relative loss bounds lemma relates hinge loss regression algorithm hinge loss arbitrary linear predictor proof equality lemma update rule regression algorithm equality divergence lemma summing equality trials relate total regression algorithm total goal obtain bounds number mistakes classification algorithm natural interpret linear threshold classifier threshold classification algorithm equality trials note sums equality unaffected trials mistake occurs trials equivalent trials mistake occurs theorem theorem trials classification algorithm makes mistake rest section classification algorithm compared perfor mance linear threshold classifier threshold apply theorem perceptron algorithm giving bound average margin linear threshold classifier threshold trial sequence inequality theorem update rule vector replace solve resulting inequality dependence bound number mistakes linear hinge loss average margin note usual mistake bound perceptron algorithm average observe predictions perceptron algorithm affected previous bound holds apply theorem normalized version winnow version winnow weights probability simplex obtained slight modification link function assume choose unlike perceptron algorithm algorithm heavily depends learning rate careful tuning needed show details omitted space limitations normalized version winnow achieves bound relative entropy probability vectors conclusions full paper study case consistent threshold carefully give involved bounds winnow normalized winnow algorithms perceptron algorithm references warmuth relative loss bounds exponential family distributions unpublished manuscript bregman relaxation method finding common point convex sets application solution problems convex programming computational mathematics physics freund schapire large margin classification perceptton algorithm grove littlestone general convergence results linear discriminant updates helmbold kivinen warmuth worstcase loss bounds linear neurons nips press kivinen warmuth additive versus exponentiated gradient updates linear prediction inform cornput kivinen warmuth relative loss bounds multidimensional gression problems nips press kivinen warmuth perceptton algorithm winnow linear logarithmic mistake bounds input variables relevant artificial intelligence littlestone learning irrelevant attributes linear threshold algorithm machine learning littlestone mistake bounds logarithmic learning algorithms thesis california santa cruz littlestone redundant noisy attributes attribute errors linear threshold learning winnow morgan kaufmann average margin positive consistent
8 laterally interconnected selforganizing maps handwritten digit recognition joseph miikkulainen department computer sciences university texas austin austin abstract application laterally interconnected selforganizing maps lissom handwritten digit recognition presented eral connections learn correlations activity units resulting excitatory connections focus activity local patches inhibitory connections activity forms internal representa tions easy recognize perceptron network recognition rate subset nist database higher lissom regular selforganizing front higher recognition input directly results form promising starting point building pattern recognition systems lissom front introduction handwritten digit recognition problems neural networks recently large databases training examples nist national institute standards technology special database realworld applications clear practical recognizing codes letters emerged diverse architectures varying learning rules proposed including feedforward networks denker martin pittman selforganizing maps dedicated approaches neocognitron fukushima problem difficult handwriting varies digits easily recognition based small crucial differences ample digits overlapping segments differences lost noise handwritten digit recogni tion process identifying distinct features producing internal representation significant differences making recognition easier laterally interconnected selforganizing maps handwritten digit recognition paper laterally interconnected selforganizing lissom miikkulainen employed form separable representation lateral inhibitory connections features input retaining differences significant lissom front actual recognition performed standard neural network architecture perceptton experiments showed direct recognition digit simple perceptton network successful time recognizing standard selforganizing front time recognition rate based lissom network results suggest lissom serve effective front realworld handwritten character recognition systems recognition system architecture system consists networks lissom performs feature analysis decorrelation input single layer percepttons final recognition figure input digit represented input layer lissom unit fully connected input layer ferent connections units lateral excitatory inhibitory connections figure excitatory connections short range connecting closest neighbors unit inhibitory connections cover perceptton layer consists units digits percepttons fully connected lissom full activation pattern input perceptton weights learned delta rule lissom afferent lateral weights hebbian learning lissom activity generation weight adaptation afferent lateral weights lissom learned hebbian tion image presented input layer initial activity calculated weighted input unit initial response activation input unit afferent weight connecting input unit unit piecewise linear approximation sigmoid activation function activity settled lateral connections activity step depends afferent activation lateral excitation inhibition excitatory inhibitory connection weights unit activation unit previous time step constants control relative strength lateral excitation inhibition activity settled afferent lateral weights modified hebb rule afferent weights normalized length weight miikkulainen output layer lissom input layer lissom unit units excitatory lateral connections units inhibitory lateral connections figure system architecture input layer activated image digit activation propagates afferent connections lissom settles lateral connections stable pattern pattern internal representation input recognized perceptron layer connections lissom perceptrons unit representing strongly activated weak activations units lateral connections unit dark square shown neighborhood excitatory connections shaded view units excitatory region inhibitory lateral connections medium shading center unit excitatory radius inhibitory radius case vector remains lateral weights normalized weights constant miikkulainen afferent weight input unit unit input learning rate lateral weight excitatory inhibitory unit lateral learning rate perceptron output generation weight adaptation percepttons output system receive activation pattern lissom input percepttons trained lissom organized activation perceptton unit scaling constant lissom unit connection weight lissom unit output unit delta rule train percepttons weight adaptation proportional activity difference output target learning rate perceptton weights lissom unit activity target activation unit correct digit laterally interconnected selforganizing maps handwritten digit recognition representation training test lissom input table final recognition results average recognition percentage variance splits shown training test sets differences statistically significant experiments subset patterns nist database training testing data patterns normalized make equal effect lissom miikkulainen lissom trained patterns train perceptton layer remaining validation determine stop training percepttons final recognition performance system measured remaining patterns lissom percepttons training experiment repeated times random splits input patterns training validation testing sets lissom organized starting initially random weights input dimensionality large case unit activated roughly degree difficult bootstrap selforganizing process miikkulainen standard selforganizing algorithm case performs preliminary feature analysis input forms coarse topological input space starting point lissom algorithm modifies topological learns lateral connections represent clear categorization input patterns initial selforganizing formed epochs training reducing neighborhood radius lateral connections added system epochs afferent lateral weights adapted equations beginning excitation radius inhibition radius excitation radius gradually decreased making activity patterns concentrated causing units selective types input terns comparison initial selforganized trained epochs gradually decreasing neighborhood size final afferent weights lissom maps shown figures lissom maps organized complete activation patterns maps collected patterns formed training input perceptton layer separate versions trained epochs lissom patterns perceptton layer trained directly input recognition performance measured counting highly tive perceptton unit correct results averaged splits average final system correctly recog pattern test sets significantly miikkulainen figure final afferent weights patterns represent afferent weights unit projected input layer lower left corner represents afferent weights unit high weight values shown black white pattern weights shows input pattern unit sensitive case local clusters sensitive digit category system achieved perceptron layer table results suggest internal representations generated lissom distinct easier recognize input patterns representations generated discussion architecture motivated hypothesis lateral inhibitory nections lissom force activity patterns distinct recognition performed simplest classification architectures perceptton lissom representations easier recognize patterns support hypothesis additional experiments perceptton output layer replaced backpropagation network hebbian trained patterns percepttons recog nition results practically perceptton backpropagation hebbian output networks indicating internal representations formed lissom important part recognition system comparison learning curves reveals interesting effects figure perceptton trained input patterns initially forms test generalization decreases dramatically training learns training examples noisy patterns good internal representations fore crucial generalization initially settling process lissom forms patterns significantly easier recognize laterally interconnected selforganizing maps handwritten digit recognition figure final afferent weights lissom squares identify inhibitory lateral connections unit thick square note inhibition areas similar functionality areas sensitive similar input activity forming representation input initial patterns formed afferent connections difference insignificant training afferent connections modified final settled patterns gradually learn anticipate decorrelated internal representations lateral connections form conclusion experiments reported paper show lissom forms internal represen tations input patterns easier categorize inputs patterns suggest lissom form front character recognition systems pattern recognition systems speech main direction future work apply approach larger data sets including full nist database powerful recognition network perceptton increase size obtain richer representation input space acknowledgements research supported part national science foundation grant computer time simulations provided pittsburgh center grants high performance computer time grant university texas austin references johnson digital maps touretzky editor advances neural processing systems mateo morgan kaufmann miikkulainen epochs figure comparison learning curves perceptron network recognize kinds internal representations settled patterns lissom patterns settling patterns final network input recognition accuracy test measured averaged simulations generalization input perceptron decreases rapidly learns training patterns difference settled lissom patterns afferent weights lissom learn account decorrelation performed lateral weights denker gardner graf henderson howard hubbard jackel baird guyon neural network recognizer handwritten code digits touretzky editor advances neural information processing systems mateo morgan kaufmann fukushima character recognition neocognitron advanced neural elsevier science northholland boser denker henderson howard hubbard jackel handwritten digit recognition back propagation network touretzky editor advances neural infor mation processing systems mateo morgan kaufmann martin pittman recognizing handprinted letters digits touretzky editor advances neural information processing systems mateo morgan kaufmann miikkulainen cooperative selforganization afferent lateral connections cortical maps biological miikkulainen ocular dominance patterned lateral connections selforganizing model primary visual cortex tesauro touretzky leen editors advances neural information processing systems cambridge press miikkulainen topographic receptive fields patterned lateral interaction selforganizing model primary visual cortex press
7 advantage updating applied differential game mance wright laboratory circle force base baird wright laboratory wright laboratory category control navigation planning reinforcement learning advantage updating dynamic programming differential games abstract application reinforcement learning differential game presented reinforcement learning system recently developed algorithm residual gradient form advantage updating game markov decision process continuous time states actions linear dynamics quadratic cost function game consists players plane plane plane reinforcement learning algorithm optimal control modified differential games order find minimax point maximum simulation results compared optimal solution demonstrating simulated reinforcement learning system converges optimal answer performance residual gradient gradient forms advantage updating qlearning compared results show advantage updating converges faster qlearning simulations results show advantage updating converges time step duration qlearning unable converge time step duration small academy suite mance baird klopf advantage updating advantage updating algorithm baird reinforcement learning algorithm types information stored state stored representing estimate total discounted return expected starting state performing optimal actions state action advantage stored representing estimate degree expected total discounted reinforcement increased performing action action considered optimal function represents true state optimal advantage function optimal action advantage relative negative action negative advantage relative action optimal advantage function defined terms optimal function definition advantage includes term ensure small time step duration advantages function advantage function needed learning convergence optimality policy extracted advantage function optimal policy state maximizes notation defines amax converges state advantage function normalized advantage updating shown learn faster qlearning watkins continuoustime problems baird advantage updating baird control deterministic system equations equivalent bellman equation iteration bertsekas pair simultaneous equations baird time step duration performing action state results reinforcement transition state optimal advantage functions satisfy equations function bellman residual errors williams baird defined equations degrees equations satisfied advantage updating applied differential game residual gradient algorithms dynamic programming algorithms guaranteed converge optimality lookup tables completely unstable combined function approximation systems baird preparation derive algorithm guaranteed convergence quadratic function approximation system bradtke algorithm specific quadratic systems solution problem derive learning algorithm perform gradient descent squared bellman residuals called residual gradient form algorithm bellman residuals residual gradient algorithm perform gradient descent squared bellman residuals found combine reinforcement learning algorithms function approximation systems tesauro function approximation systems advantage functions function approximation systems parameterized adjustable weights system controlled deterministic incremental learning weight function approximation system changed equation time step simple gradientdescent algorithm equation guaranteed converge correct answer deterministic system sense backpropagation rumelhart hinton williams guaranteed converge system nondeterministic independently generate states action performed state evaluate evaluate ensures weight change unbiased estimator true gradient requires system dyna sutton generate differential game paper deterministic needed mance baird klopf simulation game definition employed differential game comparing qlearning advantage updating comparing algorithms residual gradient forms game players plane games state vector composed state state plane composed position velocity player twodimensional space action vector composed action performed action performed plane acceleration player twodimensional space dynamics system linear state linear function current state action reinforcement function quadratic function distance players distance acceleration equation vector equivalent taking product vector seeks minimize reinforcement plane seeks maximize reinforcement plane receives acceleration allowing accelerate easily plane function quadratic function state weight matrices change learning equation advantage function quadratic function state action actions plane dimensions matrices adjustable weights change learning equation general quadratic functions true terms form simplify calculation policy form gradient form gradient respect avoids invert matrix calculating policy bellman residual update equations equations define bellman residuals maximizing total discounted reinforcement optimal control problem equations modify algorithm solve differential games optimal control problems advantage updating applied differential game minimax resulting weight update equation minimax qlearning form weight update equation minimax results residual gradient advantage updating results optimal weight matrices calculated numerically comparison residual gradient form advantage updating learned correct policy weights significant digits extensive training interesting behavior exhibited plane initial conditions plane learned cases turn short term increase distance long term figure time figure simulation dotted line plane solid line learned optimal behavior graph distance time show effects planes turning mance baird klopf comparative results error policy learning system defined squared errors matrix weights optimal policy weights problem advantage updating qlearning metric compare results algorithms learning algorithms compared advantage updating qlearning residual gradient advantage updating residual gradient learning advantage updating form unstable point meaningful results obtained simulation results experiment learning rates forms qlearning optimized significant digit simulation single learning rate advantage updating simulations advantage updating performed learning rates algorithm error calculated learning iterations process repeated times random number seeds results averaged experiment performed time step durations gradient form qlearning appeared work weights initialized small numbers initial weights chosen randomly forms algorithms gradient form qlearning small time steps qlearning performed poorly error lower learning rate learning learning rate table learning rates simulation figure shows resulting error learning final error time step duration figure error time step size comparison qlearning advantage rates optimal significant figure forms qlearning optimized advantage updating final error squared errors matrix weights time steps learning final error advantage updating lower forms qlearning case errors increased learning time step size decreased advantage updating applied differential game time step table learning rates simulation learning rates optimal significant figure forms qlearning necessarily optimal advantage updating experiment figure shows comparison algorithms ability converge correct policy figure shows total squared error algorithms policy weights function learning time simulation longer period simulations table figure learning rates simulation identical rates found optimal shorter weights gradient form qlearning bound long experiments learning rate reduced order magnitude residual gradient advantage updating learn correct policy qlearning unable learn policy initial random weights learning ability comparison error conclusion time steps millions figure experimental data shows advantage updating superior algorithms cases time step grows small qlearning unable learn correct policy future research include general networks implementation wire fitting algorithm proposed baird klopf calculate policy continuous choice actions general networks mance baird klopf acknowledgments research supported task life environmental sciences united states force office scientific research references baird advantage updating force base wright laboratory technical report defense technical information center station baird preparation residual gradient algorithms wright force base wright laboratory technical report baird klopf reinforcement learning highdimensional continuous actions force base wright laboratory technical report defense technical information center station bertsekas dynamic programming deterministic stochastic models englewood cliffs prenticehall bradtke reinforcement learning applied linear quadratic regulation proceedings annual conference neural information processing systems differential games york john wiley sons associative reinforcement learning optimal control unpublished masters thesis massachusetts institute technology cambridge aircraft horizontal plane journal guidance control rumelhart hinton williams learning representations backpropagating errors nature october sutton integrated architectures learning planning based approximating dynamic programming proceedings seventh international conference machine learning tesauro neuralnetwork backgammon program proceedings international joint conference neural networks diego tesauro practical issues temporal difference learning machine learning watkins learning delayed rewards doctoral thesis cambridge university cambridge england
10 learning nonlinear overcomplete representations efficient coding michael lewicki terrence sejnowski howard hughes medical institute computational neurobiology salk institute jolla abstract derive learning algorithm inferring overcomplete basis viewing probabilistic model observed data complete bases approximation underlying statistical density laplacian prior basis coefficients removes redundancy leads representations sparse nonlinear function data viewed generalization technique independent component anal ysis method blind source separation fewer mixtures sources demonstrate utility plete representations natural speech show compared traditional fourier basis inferred representations poten tially greater coding efficiency traditional represent signals fourier wavelet bases disadvantage bases specialized dataset principal component analysis means finding basis adapted dataset basis vectors restricted orthogonal extension called independent component analysis jutten herault comon bell sejnowski learning bases bases complete sense span input space limited terms approximate datasets statistical density representations overcomplete basis vectors input variables provide representation basis vectors specialized learning nonlinear overcomplete representations efficient coding larger variety features present entire ensemble data overcomplete representations redundant data point representations redundancy removed prior probability basis specifies probability alternative representations overcomplete bases literature fixed sense adapted structure data recently olshausen field presented algorithm overcomplete basis learned algorithm relied approximation desired probabilistic objective drawbacks including tendency case noise levels learning bases higher degrees paper present improved approximation desired probabilistic objective show leads simple robust algorithm learning optimal overcomplete bases inferring representation data modeled overcomplete linear basis additive noise matrix columns basis vectors assume gaussian additive noise defines precision noise redundancy overcomplete representation removed defining density basis coefficients specifies probability alternative representations probable representation found maximizing posterior distribution influences data presence noise determines uniqueness representation model data linear function general linear function data basis function complete invertible assuming broad priors noise probable internal state computed simply inverting case overcomplete basis inverted figure shows priors induce representations unlike gaussian prior optimal representation laplacian prior obtained simple linear operation approach optimizing gradient posterior optimization algorithm alternative method finding probable internal state view problem linear program generalized handle positive negative solved interior point linear programming methods chen lewicki sejnowski figure priors induce representations data distribution main axes form overcomplete representation graphs marked show optimal scaled basis vectors data point gaussian laplacian prior assuming noise gaussian equivalent finding exact fitting minimum norm pseudoinverse laplacian prior yields exact minimum norm nonlinear operation essentially selects subset basis vectors represent data chen resulting representation sparse segment speech overcomplete fourier representation basis vectors plot shows rank order distribution coefficients gaussian prior dashed laplacian prior solid significantly positive coefficients required gaussian prior laplacian prior learning learning objective adapt maximize probability data computed internal states general integral evaluated analytically approximated gaussian integral yielding const hessian posterior avoid singularity laplacian prior approximation hessian full rank positive large approximates true laplacian prior learning rule obtained differentiating respect discussion present derivations terms simplifying assumptions lead simple form learning rule learning nonlinear overcomplete representations efficient coding deriving term specifies change make probability represen tation probable assume laplacian prior component make representation sparse assume lira order obtain describe function basis complete assume noise simply invert obtain overcomplete simple expression make approximation priors probable solution yield nonzero elements effect selects complete basis represent reduced basis equal elements obtained removing columns elements results obtained case invertible mackay obtain matrix notation obtain expression terms original variables simply invert mapping obtain deriving term specifies change minimize data letting results notation gradient component arises error term deriving term learning rule specifies change weights minimize width posterior distribution increase probability data element defined lewicki sejnowski obtain fact symmetry hessian derive diagonal assume letting result reduced representation obtain stabilizing simplifying learning rule putting terms yields problematic expression matrix multiplying gradient positive definite matrix gradient components preserves direction valid optimization noting large noise hessian dominated vector computation involving inverse hessian basis vectors randomly distributed dimensionality increases basis vectors approximately orthogonal hessian approximately diagonal shown derivatives smooth vanishes large combining remaining terms yields equation note rule matrix inverses vector involves derivative prior case square form rule similar natural gradient independent component analysis learning rule amari difference general case rectangular maximize posterior distribution simply filter matrix standard algorithms learning nonlinear overcomplete representations efficient coding examples sources inputs examples bases initialized random normalized vectors coefficients solved interior point linear programming package probable solution laplacian prior assuming noise algorithm iterations equation stepsize convergence rapid typically requiring iterations cases direction learned vectors matched true generating distribution magnitude estimated precisely possibly approximation viewed source separation problem true separation limited projection sources smaller subspace necessarily loses information figure examples illustrating fitting distributions overcomplete bases equivalent sources mixed channels sources mixed channels data examples generated true basis elements distributed exponential distribution unit identical results obtained drawing laplacian prior positive negative coefficients overcomplete bases model capture true underlying statistical structure data space overcomplete representations speech speech data obtained timit database single speaker speaking preprocessing basis initialized overcomplete fourier basis conjugate gradient routine obtain probable basis coefficients stepsize gradually reduced iterations figure shows learned basis fourier representation power spectrum learned basis vectors multimodal andor broadband learned basis achieves greater coding efficiency bits sample compared bits sample overcomplete fourier basis summary learning overcomplete representations basis approximate statistical density data learned representations encoding denoising properties generic bases unlike case complete representations standard algorithm transformation lewicki sejnowski figure fitting overcomplete representation segments natural speech segment consisted samples sampled frequency plot shows random sample basis vectors scaled full range graph shows power spectral densities data internal representation nonlinear probabilistic lation basis inference problem offers advantages assumptions prior distribution basis coefficients made explicit models compared references amari cichocki yang learning algorithm blind signal separation advances neural information processing systems volume pages mateo morgan kaufmann bell sejnowski information maximization approach blind separation blind deconvolution neural computation chen donoho decomposition basis pursuit technical report dept stat stanford univ stanford comon jutten herault blind separation sources problems signal processing jutten herault blind separation sources adaptive algorithm based architecture signal processing mackay maximum likelihood algorithms inde pendent component analysis university cambridge laboratory interior point linear programming code olshausen field emergence properties learning sparse code natural images nature
10 efficient heuristic ranking hypotheses steve propulsion laboratory california institute technology grove drive pasadena voice content areas applications stochastic optimization model selection algorithms abstract paper considers problem learning ranking alternatives based incomplete information limited number observations describe algorithms ranking application approximately expected loss learning criteria empirical results provided demonstrate effectiveness rank procedures synthetic datasets realworld data spacecraft design optimization problem introduction learning applications cost information high imposing requirement learning algorithms information minimum data speedup learning expense processing training significant decision tree learning cost training examples evaluating potential attributes partitioning computation ally expensive evaluating medical treatment policies additional training examples suboptimal human subjects applications training data learning limited data paper statistical decisiontheoretic framework ranking parametric distributions framework provide answers wide range questions algorithms information point adequate information rank alternatives requested confidence efficient heuristic ranking hypotheses remainder paper structured describe ranking problem formally including definitions approxi correct expected loss decision criteria define algorithms establishing criteria hypothesis ranking problem cursive hypothesis selection algorithm based algorithm describe empirical tests demonstrating effectiveness algorithms improved performance standard algorithm ranking literature finally describe related work future extensions algorithms hypothesis ranking problems hypothesis ranking problems extension hypothesis selection problems abstract class learning problems algorithm hypotheses rank expected utility unknown distribution expected utility estimated training data applications system chooses single alternative visits decision systems require ability investigate options serially parallel beam search iterative broad ranking formulation case evolutionary approaches system future alternative hypotheses basis ranking current hypothesis evaluation problem achieving correct ranking practice actual underlying probability distributions small chance gorithms finite number samples requiring algorithm output correct rank impose probabilistic criteria rankings produced families requirements exist paper examine approximately correct requirement computational learning theory community expected loss requirement frequently decision theory problems expected utility hypothesis estimated observing values finite training examples satisfy requirements algorithm reason potential difference estimated true utilities hypotheses true expected utility hypothesis estimated expected utility hypothesis loss generality proposed ranking hypotheses requirement states user probability correspondingly loss selecting hypothesis hypotheses loss ranking hypothesis ranking algorithm obeys expected loss requirement produce rankings average requested expected loss bound ranking hypotheses expected utilities ranking valid ranking observed loss confidence pairwise comparison hypotheses understood clear ensure desired confidence comparisons required selection complex comparisons required ranking equation defines confidence distribution underlying utilities distributed unknown unequal variances represents cumulative standard normal distribution function size sample sample standard deviation blocked differential distribution likewise computation expected loss ordering pair hypotheses understood estimation expected entire ranking clear equation defines expected loss drawing conclusion normality details describe interpretations estimating lihood ranking satisfies requirements estimating combining pairwise errors estimates interpretations directly algorithmic implementation ranking recursive selection determine ranking view ranking recursive selection remaining candidate hypotheses view ranking error desired confidence algorithms loss thresh hold algorithms distributed selection errors subdivided pairwise comparison errors data sampled estimates pairwise comparison error dictated equation satisfy bounds algorithm degree freedom design recursive ranking algorithms method ranking error ultimately distributed individual pairwise comparisons hypotheses factors influence compute error distribution model error combination determines error allocated individual comparisons combines ranking error candidates targets distribution inequality combine errors conservative approach predicted hypothesis change sampling worst case conclusion depend pairwise comparisons error distributed pairs approach block examples reduce sampling complexity blocking forms estimates difference utility competing hypotheses observed blocking significantly reduce variance data hypotheses independent trivial modify formulas address cases block data details discussion issue efficient heuristic ranking hypotheses policy respect allocation error candidate determines samples distributed contexts consequences early scenarios implemented ranking algorithms divide ranking error favor earlier divide selection error pairwise error based estimates hypothesis parameters order reduce sampling cost error scope paper algorithms combine pairwise error selection error combine selection error ranking error allocate error equally level disadvantage recursive selection hypothesis selected removed pool candidate hypotheses problems rare instances sampling increase confidence selection estimate hypothesis previously selected hypothesis longer dominates case algorithm taking account data sampled assumptions result formulations denote error action selecting hypothesis equation denotes error selection loss situations equation applies base case recursion selection error defined equation compute pairwise confidence implement sampling default number times seed estimates hypothesis variance allocating error selection pairwise comparisons sampling desired successive algorithm hypotheses means changed change ranking analogous recursive selection algorithm based expected loss defined selection defined constraints description ranking comparison adjacent elements interpretation ranking confidence loss adjacent elements ranking compared case ranking error divided directly pairwise comparison errors leads confidence equation criteria equation criteria ranking comparison adjacent hypotheses establish hypotheses hypotheses ordered served utility advantage requiring fewer comparisons recursive selection require fewer samples recursive selection reason algorithms correctly bound probability correct selection average loss recursive selection algorithms case algorithms necessarily case algorithms expected loss additive hypothesis relations sharing common instance size blocked differential distribution pairs hypotheses compared relevant approaches standard statistical approaches make strong assumptions form problem variances underlying utility distribution hypotheses assumed equal weiss comparable approach weiss treat hypotheses normal random variables unknown unknown unequal variance make additional hypotheses independent reasonable approach candidate hypotheses independent excessive statistical error large training sizes result empirical performance evaluation turn empirical evaluation hypothesis ranking techniques real world datasets evaluation serves purposes demonstrates techniques perform predicted terms bounding probability incor selection expected loss performance tech niques compared standard algorithms statistical literature evaluation demonstrates robustness approaches realworld hypothesis ranking problems experimental trial consists solving hypothesis ranking problem technique problem control parameters measure perfor mance algorithms satisfy respective criteria number samples performance statistical algorithms single trial information behavior trial repeated multiple times results averaged trials approaches investigated extensively statistical ranking selection literature topic confidence interval based algorithms review recent literature efficient heuristic ranking hypotheses table estimated expected total number observations rank spacecraft designs achieved probability correct ranking shown table estimated expected total number observations expected loss incorrect ranking designs parameters samples loss samples loss expected loss criteria directly comparable approaches analyzed separately experimental results synthetic datasets reported eval approach artificially generated data show techniques correctly bound probability incorrect ranking expected loss underlying assumptions valid underlying utility distributions inherently hard rank techniques favorably algorithm weiss wide variety problem configurations test realworld applicability based data drawn actual nasa spacecraft design optimization application data strong test applicability techniques statistical techniques make form normality assumption data application highly tables show results ranking designs based expected loss algorithms problem utility function depth cases assigned utility shown table algorithms significantly outperformed algorithm expected hypotheses correlated impact orientations densities table shows expected loss algorithm effectively bounded actual loss algorithm inconsistent discussion conclusions number areas related work considerable analysis hypothesis selection problems selection problems formalized bayesian framework require initial sample rigorous encoding prior knowledge howard details bayesian framework analyzing learning cost selection problems hypothesis selection framework ranking allocation pairwise errors performed reinforcement learning work feedback viewed hypothesis selection problem summary paper hypothesis ranking problem extension hypothesis selection problem defined application decision criteria approximately correct expected loss problem defined families algorithms recursive selection solution hypothesis ranking problems finally demonstrated effectiveness algorithms synthetic realworld datasets improved formance existing statistical approaches references multiple decision procedure ranking means normal populations variances annals math statistics efficient allocation resources hypothesis evaluation statistical approach ieee trans pattern analysis machine intelligence july efficiently ranking hypotheses machine learning june online httpwww html goldberg genetic algorithms search optimization machine learning sequential statistical analysis american sciences press solution problem speedup learning proc jose july decisiontheoretic approach adaptive problem solving tech dept comp univ illinois improving learning performance rational resource allocation proc seattle august statistical approach solving problem proc jose july modern statistical selection sciences press introduction mathematical statistics london howard decision analysis perspectives inference decision experimentation proceedings ieee kaelbling learning embedded systems press cambridge learning search control knowledge explanationbased proach kluwer academic moore efficient algorithms minimizing cross error proc july russell decision theoretic induction large databases proc june rivest sloan model inductive inference proc conference theoretical aspects reasoning knowledge russell thing studies limited press theory unsupervised speedup learning proc weiss class sequential procedures problems normal means unknown unequal variances design experiments ranking selection marcel dekker valiant theory learnable communications
4 locomotion lower vertebrate studies cellular basis oscillator coupling james department biology university abstract test neurons lamprey spinal cord sufficient account connection neural network simulation identical cells connected experimentally established patterns demonstrated network oscillates stable manner phase relation neurons observed lamprey model explore coupling identical oscillators concluded neurons dual role rhythm generators oscillators produce phase relations observed segmental oscillators swimming introduction approach analyzing neurobiological systems simpler amenable techniques investigate cellular synaptic network levels organization involved generation behavior proach yielded significant progress analysis rhythm pattern generators invertebrate stomatogastric ganglion lobster selverston carrying similar types studies rhythm generation vertebrate preparation lamprey spinal cord offers technical advantages invertebrate nervous systems understanding identified lamprey interneurons participate coupling oscillators neural network models swimming neuronal correlate swimming induced isolated lamprey spinal cord exposure considered principal excitatory neurotransmitter intact swimming lamprey swimming characterized periodic bursts motoneuron action potentials lateral edge input current figure lamprey spinal interneurons types interneurons intracellular inhibitory excitatory postsynaptic poten effects selective firing frequency spike intervals current injection locomotion network ventral roots bursts alternate sides spinal cord propagate direction forward swimming cohen williams cellular mechanisms generating basic swimming pattern spinal cord demonstrated vertebrates peak depolarization peak swim cycle figure connectivity activity patterns synaptic connectivity interneurons motoneurons bottom histograms activity recorded swimming timing activity neurons onset ipsilateral ventral root burst swimming rhythm generator thought consist chain coupled distributed length spinal cord isolated spinal cord pieces small segments length level exhibit alternating ventral root bursting application intrinsic swimming frequency pieces spinal cord twofold consistent relationship intrinsic frequency level piece originated observed cohen coupling oscillators provide capacity cope intrinsic frequency differences feature coupling constancy phase wide range swimming cycle periods delay ventral root burst onsets segments constant fraction cycle period williams cycle period swimming lamprey vary range axonal conduction time factor delay segments spinal interneurons recent years classes spinal neurons characterized variety neurobiological techniques intracellular recording brane potential classes neurons active swimming include lateral interneurons cells axons projecting excitatory interneurons large neurons projecting inhibitory axon rons inhibitory cells small interneurons projecting axons axons cell types project segments interact neurons multiple segments neurons similar resting firing prop erties indistinguishable resting potentials thresholds action potential amplitudes durations potentials main differences parameters input resistance membrane time constant fire action potentials duration long current pulses showing adaptation frequency successive tion potentials plots spike frequency input current cell types generally monotonic tendency saturate higher levels input current synaptic cells established simultaneous intracellular recording postsynaptic neurons results activity patterns swimming cells exhibit oscillating membrane potentials peaks tend occur ventral root burst occur cycle cohen oscillations large part phases synaptic input excitatory phase inhibitory phase russell excitatory phase motoneurons inhibitory phase interneurons interact motoneurons interneurons possibility exists interneurons provide synaptic drive neurons network motoneurons addition locomotion network ally pattern synaptic connectivity circuit basic alternating network reciprocal interneurons opposite sides spinal cord reciprocal inhibition oscillatory network form provided feedforward inhibition ipsilateral interneurons inhibition early peak observed interneurons swimming neural network model ability network generate basic oscillatory pattern swimming tested connectionist neural network simulation cells neural network identical inputoutput curves differed excitatory levels synaptic connectivity scheme excitation made larger network oscillate oscillations began fairly continued thousands cycles phase relations units similar lamprey cells opposite sides spinal cord cells side cord significantly model lamprey phase advanced inhibition figure activity neural network model lamprey circuit neural network model lamprey swimming oscillator explore coupling oscillators achieved identical oscillator networks coupled pairs cells network connected pairs cells network pairs connections tested interneurons interact neurons multi segments coupling evaluated criteria based observations lamprey swimming stability phase difference oscillators rate achieving steadystate ability coupling tolerate intrinsic frequency differences oscillators constancy phase wide range oscillator frequencies time cycle period time added figure coupling identical oscillators connectivity steadystate coupling single cycle constancy phase range oscillator periods adding oscillator phase simulating backward swimming locomotion network pairs interneurons oscillators capable producing stable phase locking coupling connections operated wider range synaptic weights steadystate phase ference oscillators rate reaching dependent synaptic weight coupling connections direction phase differ ence postsynaptic oscillator leading depended type postsynaptic cell sign coupling postsynaptic cell speeds network excitation coupling connection produced lead postsynaptic network inhibition produced opposite pattern held slow network coupling scheme satisfied criteria pling shown case bidirectional symmetric coupling oscillators gave network ability intrinsic frequency differences oscillators capacity provide phase oscillator connected greater weight direction coupling reached steadystate single cycle phase difference maintained range cycle periods backward swimming shown recently rhythmic presynaptic inhibition axons lamprey spinal cord type modulation synaptic strength account shifts phase coupling lamprey occurs animal switches backward swimming mechanism backward swimming connection axons segments body length neural network model descending interneurons backward swimming phase lead postsynaptic oscillators presynaptic inhibition connections local segments forward swimming removal presynaptic inhibition backward swimming conclusions modeling demonstrates identified interneurons lamprey spinal cord contribute synaptic input motoneurons swimming shaping final motor output function components rhythm generating network finally virtue connections additional role providing coupling signals experimental work required determine connections lamprey spinal cord functions references presynaptic modulation axons spinal motor interneurons neurosci identification interneurons contralateral caudal axons lamprey spinal cord synaptic interactions morphology neuro physiol electrophysiological properties lamprey spinal neurons neurosci abstr neural network simulations coupled oscillators lamprey spinal cord biol cybern press cohen activities identified interneurons muscle fibers swimming lamprey effects dorsal cell stimulation neurophysiol identification interneurons contributing generation locomotion lamprey structure function neurophysiol cohen intersegmental system lamprey exper theoretical studies stein stuart neurobiology vertebrate locomotion cohen neuronal correlate locomotion fish swimming induced vitro preparation lamprey spinal cord brain control locomotion fish brooks handbook physiology nervous system motor control maryland press patterns synaptic interneurons swimming lamprey revealed comp neurol synaptic interactions identified nerve cells spinal cord lamprey comp neurol russell control swimming lamprey spinal cord vitro physiol selverston miller cooperative mechanisms production rhythmic movements biol williams locomotion lamprey spinal cord vitro compared swimming intact spinal animal physiol
0 constrained differential optimization john platt alan california institute technology pasadena abstract optimization models neural networks constraints restrict space outputs subspace satisfies external criteria energy methods yield forces state neural network penalty method quadratic energy constraints added existing optimization energy popular recently guaranteed satisfy constraint conditions forces neural model multiple constraints paper present basic differential multiplier method bdmm satisfies constraints create forces gradually apply constraints time neurons estimate lagrange multipliers basic differential multiplier method differential version method multipliers numerical analysis prove differential equations locally converge constrained minimum examples applications differential method multipliers include enforcing permutation codewords analog decoding problem enforcing valid tours traveling salesman problem introduction optimization ubiquitous field neural networks learning algorithms backpropagation optimize minimizing difference expected solutions observed solutions neural algorithms differential equations minimize energy solve computational problem associative memory differential solution salesman problem analog decoding linear programming lyapunov methods show models neural behavior find minima functions solutions constrained optimization problem restricted subset solutions optimization problem mutual inhibition circuit requires neuron rest salesman problem salesman minimize travel distance subject constraint visit city curve fitting problem elastic splines smooth data finally digital decisions made analog data answer constrained bits constrained optimization problem stated subject state neural network position vector highdimensional space scalar energy height landscape function position scalar equation describing subspace state space constrained optimization state attracted subspace subspace reaches locally smallest section paper describe classical methods constrained optimization penalty method multipliers section introduces basic differential multiplier method bdmm constrained optimiza tion calculates good local minimum constrained optimization problem convex local minimum global minimum general finding global minimum nonconvex problems fairly difficult section show lyapunov function bdmm drawing analogy physics american institute physics section augmented idea optimization theory enhances convergence properties bdmm section apply differential algorithm neural problems discuss bdmm choice parameters parameter sensitivity persistent problem neural networks classical methods constrained optimization section discusses methods constrained optimization penalty method lagrange multipliers penalty method previously differential optimization basic differential multiplier method developed paper applies lagrange multipliers differential optimization penalty method penalty method analogous adding band neural state subspace penalty method adds quadratic energy term penalizes viola tions constraints constrained minimization problem converted unconstrained minimization problem minimize figure penalty method makes state space penalty method extended fulfill multiple constraints band constrained optimization problem minimize subject converted unconstrained optimization problem minimize penalty method convenient features easy globally convergent correct answer constraints case spline curve fitting input data compromise fitting data making smooth spline penalty method number disadvantages finite constraint strengths doesnt fulfill constraints multiple band constraints building machine bands machine hold perfectly constraints added constraint strengths harder size network dimensionality large addition dilemma setting constraint strengths strengths small system finds deep local minimum fulfill constraints strengths large system quickly constraints stuck poor local minimum lagrange multipliers lagrange multiplier methods convert constrained optimization problems unconstrained problems solution equation critical point energy called lagrange multiplier constraint direct consequence equation gradient gradient constrained extrema figure constant proportionality design bdmm contours figure constrained minimum simple shows lagrange multipliers provide degrees freedom solve constrained optimization problems problem finding point line closest origin lagrange multipliers derivative respect variables extra variable equations unknowns addition equation precisely constraint equation basic differential multiplier method constrained optimization section presents neural algorithm constrained optimization consisting equations estimate lagrange multipliers neural algorithm variation method multipliers presented gradient descent work lagrange multipliers simplest differential optimization algorithm gradient descent state variables network opposite gradient applying gradient descent energy equation yields note auxiliary differential equation additional neuron apply constraint recall system constrained energies involving lagrange multipliers critical points tend saddle points energy equation frozen energy decreased sending gradient descent work lagrange multipliers critical point energy equation attractor stationary point local minimum order gradient descent converge algorithm basic differential multiplier method present alternative differential gradient descent estimates lagrange multipliers constrained minima attractors differential equations differential equations solve equation similar equation equation constrained extrema stationary points equation notice sign inversion equation compared equation equation performing gradient ascent sign flip makes bdmm stable shown section equation corresponds neural network connections neuron neurons extensions algorithm extension equation algorithm constrained minimization multiple straints adding extra neuron equality constraint summing constraint forces creates energy yields differential equations extension constrained minimization inequality constraints traditional optimization theory extra slack variables convert inequality constraints equality constraints constraint form expressed positive constrained positive slack variable treated component equation inequality constraint requires extra neurons slack variable lagrange multiplier alternatively inequality constraint represented equality constraint optimization constrained algorithm works system differential equations bdmm gradually constraints notice function replaced changing location constrained minimum increased state begins undergo damped oscillation constraint subspace increased frequency oscillations increase time convergence increases subspace path algorithm force constraint initial state figure state attracted constraint subspace damped oscillations equation explained combining differential equations secondorder differential equation equation equation damped mass system inertia term damping matrix internal force derivative internal energy system damped state remains bounded state falls constrained minima physics construct total energy system kinetic potential energies total energy decreasing time state remains bounded system extra energy settle state constrained original problem equation time derivative total energy equation damping matrix positive definite system converges fulfill constraints bdmm converges special case constrained optimization quadratic programming quadratic programming problem quadratic function piecewise linear continuous function positive circumstances damping matrix positive definite system converges multiple constraints case multiple constraints total energy equation time derivative bdmm solves quadratic programming problem solution exists pose problem constraints case conflicting constraints bdmm make constraint small lagrange multipliers arbitrarily limit large absolute invariance theorem prove bdmm eventually open subset subset closure system differential equations equilibrium damping matrix positive bounds time system modified differential method multipliers cation robust operation problem bdmm region positive multipliers method yield modified locally convergent compatible adds force equation energy differential equations force penalty change position stationary points differential equations penalty force damping matrix modified penalty force theorem states exists damping matrix equation positive definite minima continuity damping matrix positive definite region surrounding minimum system starts region remains bounded convergence theorem section applicable converge constrained minimum minimum penalty strength strength needed penalty method examples section examples illustrate bdmm bdmm find good solution planar traveling salesman problem enforcing mutual inhibition digital results task analog decoding planar traveling salesman traveling salesman problem cities lying plane find shortest closed path city finding shortest path npcomplete finding optimal path easier finding globally optimal path exist heuristic algorithms approximately solving traveling salesman solution presented section moderately effective illustrates independence bdmm parameters durbin elastic snake solve snake discretized curve lies plane elements snake points plane snake locally connected neural network neural outputs positions plane snake minimizes length subject constraint snake cities city coordinates closest snake point city constraint strength minimization equation quadratic constraints equation piecewise linear continuous potential energy equation damping positive definite system converges state constraints fulfilled practice snake starts circle groups cities snake snake close groups cities specific ordering cities locally minimize length figure system differential equations solve equations piecewise linear differential equations solved implicit method decomposition solve linear system points snake sorted bins divide plane computation finding nearest point simplified figure snake eventually cities constrained minimization equations reasonable method approximately solving cities distributed square snake points numerical step size time units constraint strength tour lengths longer yielded simulated annealing empirically cities time needed compute final city ordering scales compared method scales roughly constraint strength city problem city problem changing constraint strength affects performance snake cities constraint strength parameter adjustment issue number cities increases unlike penalty method analog decoding analog decoding analog signals noisy channel reconstruct codewords analog decoding performed neurally code space permutation matrices space binary matrices perform decoding permutation matrices nearest permutation matrix signal matrix found words find nearest matrix signal matrix subject constraint matrix onoff binary elements column signal matrix result minimize subject constraints constraint equation forces crisp digital decisions constraints mutual inhibition rows columns matrix optimization equation quadratic linear addition constraint equation nonlinear bdmm results oscillations order converge constrained minimum system adequate damping oscillations choice insensitive size system wide range oscillations figure decoder finds nearest permutation matrix test signal matrix permutation matrix noise signaltonoise ratio supplied network figure system turned correct neurons incorrect neurons constraints start applied eventually system reaches permutation matrix differential equations reset signal matrix applied network neural state move solution conclusions field neural networks differential optimization algorithms find local solutions nonconvex problems basic differential multiplier method modification standard constrained optimization algorithm improves capability neural networks perform constrained optimization bdmm offer advantages penalty method differ equations penalty method large quadratic terms needed order strongly enforce constraints energy penalty method steep finding minima types energy surfaces numerically difficult addition steepness penalty terms sensitive dimensionality space differential multiplier methods promising techniques stiffness differential multiplier methods separate speed constraints constraints penalty method strengths constraint constraint fulfilled energy undesirable local minima differential multiplier methods choose quickly fulfill constraints bdmm constraints compatible penalty method addition penalty terms change stationary points algorithm helps oscillations improve convergence bdmm form firstorder differential equations directly implemented hardware performing constrained optimization speed analog vlsi promising technique solving difficult perception problems exist lyapunov functions bdmm bdmm converges ally quadratic programming provably convergent local region constrained minima optimization algorithms newtons method similar convergence properties global convergence properties bdmm investigation summary differential method multipliers enforcing constraints neural networks enforcing syntax solutions encouraging desirable properties solutions making crisp decisions acknowledgments paper supported bell laboratories fellowship references arrow studies linear nonlinear programming stanford university press stanford bertsekas practical guide splines springerverlag cohen grossberg ieee trans systems cybernetics durbin willshaw nature physiology nerve cells johns hopkins press baltimore theory appl optimization theory wiley sons hopfield pnas hopfield tank biological cybernetics kirkpatrick vecchi science stability dynamical systems siam philadelphia mead analog vlsi neural systems addisonwesley reading platt hopfield conf proc neural networks computing denker american institute physics optimization academic press press teukolsky numerical recipes bridge university press cambridge rumelhart hinton williams parallel distributed processing rumelhart press cambridge tank hopfield ieee trans
8 discovering structure continuous variables bayesian networks hofmann volker tresp siemens central research germany abstract study bayesian networks continuous variables linear conditional density estimators demonstrate structures extracted data selforganized present sampling techniques belief update based markov blanket conditional density models introduction strongest types information learned unknown process discovery dependencies important indepen superior medical goal find disease exclude factors irrelevant complete independence variables domain rare reality joint probability density variables factored conditional independence common result true apparent case independent condition precisely notion effect resulting independence variables represented explicitly bayesian networks pearl argued causal thinking leads clear knowledge representation form conditional probabilities efficient local belief propagating rules bayesian networks form complete probabilistic model sense repre joint probability distribution variables involved powerful volker discovering structure continuous variables bayesian networks features bayesian networks variable predicted variables bayesian networks make explicit statements certainty estimate state variable aspects important medical fault diagnosis systems recently learning structure parameters bayesian networks addressed allowing discovery structure variables buntine heckerman research bayesian networks focused systems discrete variables linear gaussian models combinations linear continuous variables pose problem bayesian networks words pearl representing continuous quantity estimated magnitude range uncertainty quickly produce computational continuous variables impose computational paper present approaches applying concept bayesian networks arbitrary nonlinear relations continuous variables fast learners parzen windows based conditional density estimators modeling local depen demonstrate parsimonious bayesian network extracted data unsupervised selforganized learning belief update local markov blanket conditional density models combination gibbs sampling efficient sampling conditional density unknown variable bayesian networks introduction bayesian networks closely heckerman joint probability density variables decompose chain rule probability variable parents denoted variables renders independent note include elements indi conditional independence variables included variables dependencies variables depicted directed acyclic graphs directed arcs members parents child bayesian networks natural description dependencies variables depict causal relationships tween variables bayesian networks commonly representation knowledge domain experts experts define structure bayesian network local conditional probabilities recently great notation treat continuous case handling mixtures continuous discrete variables impose additional smallest note defined respect ordering variables directed loops hofmann tresp emphasis learning structure parameters bayesian networks heckerman previous work concentrated models discrete variables linear models continuous variables probability distribution continuous discrete variables multidimensional gaussian paper ideas context continuous variables nonlinear dependencies learning structure parameters nonlinear continuous bayesian networks structures developed neural network community model conditional density distribution continuous variables usual independent gaussian noise model feedforward neural work conditional density model notation normal density centered variance complex conditional densities modeled mixtures experts parzen windows based density estimators periments section generic conditional probability model joint probability model equations learning bayesian networks decomposed problems learning structure arcs network learning conditional density models structure structure network data complete data train conditional density models independently loglikelihood model decomposes conveniently indi likelihoods models conditional probabilities competing network structures basically faced wellknown biasvariance dilemma choose network arcs introduce large parameter variance remove arcs introduce bias problem complex freedom reverse arcs experiments evaluate network structures based model likelihood leaveoneout crossvalidation defines scoring function network structures explicitly score network structure score prior network structures leaveoneout crossvalidation likelihood referred training samples probability density sample structure samples terms computed local densities equation networks computationally impossible calculate score network structures search global optimal network structure heckerman follow fully bayesian approach priors defined parameters structure fully bayesian approach elegant solved closed form case general nonlinear models data incomplete discovering structure continuous variables bayesian networks nphard section describe heuristic search closely related search strategies commonly discrete bayesian networks heckerman prior models bayesian framework provide means exploiting prior knowledge typically introducing bias simple structures biasing models simple structures model selection criteria based crossvalidation case variance score experiments added penalty loglikelihood number arcs parameter determines weight penalty specific knowledge form structure defined domain expert alternatively penalize deviation structure heckerman prior knowledge introduced form artificial training data treated identical real data loosely correspond concept conjugate prior experiment experiment parzen windows based conditional density estimators model conditional densities equation training gaussians centered location sample joint inputoutput space gaussians denominator centered location sample input parent space conditional model optimized leaveoneout cross validation unsupervised structure optimization procedure starts complete bayesian model equation model pair variables direction additions produce directed loops evaluate change score evaluating legal single modifications accept change improves score procedure stops change decreases score greedy strategy stuck local principle avoided result worse performance accepted nonzero probability annealing strategies heckerman calculating score step requires local computation removal addition corresponds simple removal addition dimension gaussians local density model maintained global density estimators maintain equivalence means network independence model score test order nodes determining direction initial arcs random experiments treated small score allowing small decreases score hofmann tresp number iterations number inputs figure left evolution dashed loglikelihood test continuous structure optimization curves averages runs partitions training test sets likelihoods normalized respect number penalty dotted line shows parzen joint density model commonly statistics assuming independencies width gaussians conditional density models loglikelihood local conditional parzen model variable test continuous dashed function number parents inputs rate percent land percent business located charles oxide concentration average number rooms percent built weighted distance center access radial rate ratio percent black percent population median figure final structure full data operation widths gaussians affected local models optimized reversal simply execution removal addition experiment boston housing data ples sample consists housing price variables influence housing price boston neighborhood figure figure left shows experiment samples test monitor process algorithm sees test data increase likelihood model test data unbiased estimator model improved extraction structure data large increase loglikelihood understood studying figure picked single variable node formed density model predict vari remaining variables removed input variables order significance removal variable optimized note increases input variables left fact discovering structure continuous variables bayesian networks irrelevant variables variables represented remaining input variables removed loglikelihood fully connected initial model figure left runs test scores final structures standard deviation comparing final structures terms undirected arcs difference average structure runs depicted figure comparison initial complete structure arcs arcs left arcs changed direction advantages bayesian networks easily interpreted goal original boston housing data experiment examine oxide concentration influences housing price structure extracted algorithm dependent vari ables common child variables independent interesting question quantities predicting housing price variables render housing price independent variables parents children parents variable variables bayesian networks directions arcs induce independencies direction arcs uniquely determined expected arcs reflect direction missing data markov blanket conditional density model bayesian networks typically applications variables miss partial information states subset variables goal update beliefs probabilities unknown variables powerful local update rules networks discrete variables undirected loops belief update networks loops general nphard generally applicable update rule unknown variables networks discrete continuous variables gibbs sampling gibbs sampling roughly variables state states unknown variables choose initial states pick variable update probability distribution repeatedly unknown variables discard samples samples generated drawn probability distribution unknown variables variables samples easy calculate expected unknown variables estimate variances covariances statistical measures mutual information variables direction arcs unique difference undirected arcs compare structures number arcs present structures respect number arcs fully connected network hofmann tresp gibbs sampling requires sampling univariate probability distribution equation straightforward model conditional sity convenient form sampling techniques importance sampling case typically produce rejected samples inefficient alternative sampling based markov blanket conditional density models markov blanket smallest variables bayesian network markov blanket variable consists parents parents idea form conditional density model variable network computing equation sampling model simple conditional parzen models conditional density mixture gaussians sample rejection markov blanket conditional density models interesting interested predicting variable neural network applications assuming model good model conditional density train ordinary neural network predict variable interest addition train model input variable predicting remaining variables addition tained model complete data case handle missing inputs backward inference gibbs sampling conclusions demonstrated bayesian models local conditional density estimators form promising nonlinear dependency models continuous variables conditional density models trained locally training data complete paper focused selforganized extraction structure bayesian networks serve framework modular construction large systems smaller conditional density models bayesian framework consistent update rules probabilities communication modules finally input pruning variable selection neural networks note pruning strategy figure considered form variable selection removing variables statistically independent output variable removing variables represented remaining variables obtain compact models input values missing indirect influence pruned variables output recovered sampling mechanism references buntine operations learning models journal artificial intelligence research heckerman tutorial learning bayesian networks microsoft research pearl probabilistic reasoning intelligent systems mateo morgan kaufmann open issues consistency conditional models
2 jain waibel incremental parsing modular recurrent connectionist networks jain alex waibel school computer science carnegie mellon university pittsburgh abstract present modular recurrent connectionist network architec ture learns robustly perform incremental parsing complex sentences sequential input word time networks learn semantic role assignment noun phrase clause recognition sentences passive center embedded clauses networks make syntactic semantic predictions point time previous predictions revised expectations violated arrival networks induce grammar rules dynamically transforming input sequence words networks generalize display tolerance input corrupted ways common spoken language introduction previously reported experiments connectionist models small task network formalism extends backpropagation sequential symbolic domains parsing jain showed nectionist networks learn complex dynamic behavior needed parsing task included passive sentences require dynamic incorporation previously context information partially built interpretations trained parsing network exhibited predictive behavior modify confirm incremental parsing modular recurrent connectionist networks units clause phrase word level clause structure units phrase word units figure highlevel parsing architecture hypotheses sentences sequentially processed generalize tolerate input paper describe work extending parsing architecture complex sentences paper organized briefly outline network formalism general architecture parsing task defined procedure constructing training parser presented dynamic behavior parser illustrated performance characterized network architecture developed extension backpropagation networks specifically designed perform tasks sequential domains requiring symbol manipulation jain substantially connectionist approaches sequential problems elman jordan waibel major features formalism units retain partial activation updates respond repetitive weak stimuli singular sharp stimuli units responsive static activation values units dynamic symbol buffers constructed groups units connections gated units formalism supports recurrent networks networks learn complex timevarying behavior gradient descent procedure error backpropagation figure shows highlevel diagram general parsing architecture organized hierarchical levels word phrase clause structure clause roles inter presentation work appears jain waibel jain waibel clause description proceed bottom word presented network stimulating word unit short time produces pattern activation feature units represents meaning word connections word units feature units encode semantic syntactic information words network fixed phrase level sequence word representations word level build contiguous phrases connections word level phrase level modulated gating units learn required conditional assignment behavior clause structure level maps phrases constituent clauses input sentence clause roles level describes roles relationships phrases clause sentence final level inter clause represents clauses section defines parsing task detailed description construction training parsing network performs task incremental parsing parsing spoken language desirable process input word time words produced speaker incrementally build output representation tight bidirectional coupling parser underlying speech recognition system system parser processes information produced predictive information recognition system based rich representation current context mentioned earlier previous work applying connectionist parsing task promising experiment extends previous work complex sentences significant scale increase parsing task domain experiment sentences clauses including trivial passive constructions sentences tree john mary gave book snake sequential input word time task incrementally build represen tation input sentence includes information phrase structure clause structure semantic role assignment relationships figure shows representation desired parse sentence list networks lexical acquisition successfully miikkulainen dyer building large systems makes sense efficiency perspective lexical information network design choice building large systems training contained sentences subset sentences form parser based left associative grammar sentences interesting reflect statistical structure common speech incremental parsing modular recurrent connectionist networks clause clause action patient agent action snake patient relative clause phrase figure representation sentence constructing parser architecture network figure describe detailed network structure bottom constraints numbers objects labels fixed network architecture scalable network modularity architectural constraints exploited minimize training time maximize generalization network constructed separate recurrent subnetworks trained perform portion parsing task training sentences performance full network discussed detail section phrase level types units phrase block units gating units hidden units phrase blocks capture words forming phrase phrase blocks sets units called slots target activation patterns correspond word feature patterns words phrases slot gating unit learns conditionally assign activation pattern feature units word level slot gating units input connections hidden units hidden units input connections feature units gating units phrase block units direct recurrence gating hidden units gating units learn inhibit compete indirect recurrence arising connections phrase blocks hidden units context current input word target activation values gating unit dynamically calculated training gating unit learn active proper time order perform parsing phrase block gating hidden units weights phrase blocks phrase level phrase present position training phrase blocks learn parse clause roles level shared weights separate clause modules level trained simulating sequential building mapping clauses sets units phrase blocks clause figure types units level labeling units hidden units labeling units learn label phrases clauses semantic roles phrases phrases clause units assigns role labels agent patient action phrases units indicating modification hidden units connected labeling units provide context competition phrase level input connections phrase blocks single clause training targets labeling units beginning input presentation remain static order minimize global error training units learn active inactive jain waibel input forces network learn predictive clause structure levels trained simultaneously single module types units level mapping labeling hidden units mapping units assign phrase blocks clauses labeling units relative clause clause relationships mapping labeling units connected hidden units input connections phrase blocks phrase level behavior phrase level simulated training module module utilizes weight sharing techniques clause roles level targets labeling mapping units beginning input presentation inducing type predictive behavior parsing performance separately trained single network performs full parsing task additional training needed full parsing network significant differences actual subnetwork perfor mance simulated subnetwork performance training network successfully modeled large diverse training section discusses aspects parsing networks performance dynamic behavior integrated network eralization tolerance noisy input dynamic behavior dynamic behavior network illustrated sentence figure snake sentence training space limitations actual plots network behavior presented small portion network initially units network resting values units phrase blocks activation word unit stimulated causing word feature representation active feature units word level gating unit slot phrase block active feature representation assigned slot gate word presented remaining words sentence processed similarly resulting final phrase level representation shown figure occurring higher levels network processing evolving phrase level representation behavior mapping units clause structure level shown figure early presentation word clause structure level phrase blocks belong clause reflecting dominance single clause sentences training assigned phrase block hypothesis revised network embedded clause possibly phrases phrase predictive behavior emerged spontaneously training procedure large majority sentences training beginning embedded clauses phrase words confirm networks expectation word network decide embedded clause phrases incremental parsing modular recurrent connectionist networks snake snake figure clause structure dynamic behavior main clause correct structure sentence confirmed remainder input level relative clause relationship initial hypothesis embedded clause clause roles level processes individual clauses mapped clause structure level labeling units clause initially hypothesize role structure competition role struc ture agent patient units activation traces clause phrase shown figure prediction occurs active constructs passive training final decision role structure embedded clause presented verb phrase immediately role structure dominate network fourth phrase mary expected agent clause role structure predicted clause time prediction generalization type generalization automatic detail word representation scheme omitted previous discussion feature patterns parts part identification part representations john peter differ parts units network learn input connections portions word units network learns jain waibel snake figure clause roles dynamic behavior parse john gave parse peter type generalization extremely addition words network processing sentences explicitly trained network generalizes correctly process distinct ignoring features training weight sharing tech niques phrase clause levels impact difficult measure generalization quantitatively statements made types sentences correctly processed relative training sentences substitution single words resulting meaningful sentence exception substitution entire phrases phrases errors structural parsing sentences similar training exemplars network sentences formed composition familiar sentences clauses tolerance noise types noise tolerance interesting analyze word tions poorly articulated short function words variance word speed inter word repetitions effects noise simulated testing parsing network training sentences corrupted ways listed note parser trained wellformed sentences sentences made ungrammatical processed difficulty sentences verb phrases badly corrupted produced reasonable interpretations sentence peter gave received role structure gave supposed gave interpretation corrupted verb phrases context dependent single clause sentences randomly deleted simulate speech recognition errors processed correctly percent time multiple clause sentences degraded similar manner produced parsing errors fewer examples sentence types hurt performance deletion function words beginning phrases produced errors deletions critical function words introducing clauses caused problems incremental parsing modular recurrent connectionist networks network sensitive variations word presentation speed trained constant speed partial phrase repetitions tested network perform sentences networks trained complex parsing tasks possibility weight sharing preventing formation strong attractors training sentences appears tradeoff generalization noise tolerance conclusion presented connectionist network architecture application nontrivial parsing task hierarchical modular recurrent network constructed successfully learned parse complex sentences parser exhibited predictive behavior dynamically hypotheses techniques maximizing generalization discussed network performance sentences impressive results testing sensitivity types noise mixed parser performed sentences sentences function word deletions acknowledgments research funded grants interpreting research national science foundation grant number dave touretzky helpful comments discussions references elman finding structure time tech center research language university california diego computation language springerverlag jain connectionist architecture sequential symbolic domains tech school computer science carnegie mellon university jain waibel robust connectionist parsing spoken language proceedings ieee international conference acoustics speech signal processing jordan serial order parallel distributed processing approach tech institute cognitive science university california diego miikkulainen dyer encoding inputoutput representations connectionist cognitive systems touretzky hinton sejnowski proceedings connectionist models summer school morgan kaufmann publishers waibel hanazawa hinton shikano lang phoneme recog nition timedelay neural networks ieee transactions acoustics speech signal processing
8 neural network model lightness perception univ william ross boston university boston abstract neural network model lightness perception presented builds theory boundary contour contour system grossberg colleagues early ratio encoding retinal ganglion neurons results constancy background constancy provide functional constraints theory suggest contrast negation hypothesis states ratio measures regions weight determination lightness respective regions simulations model address data lightness perception including ratio hypothesis cross illusion introduction visual experience includes surface color constancy variations scene lighting movement displacement visual contexts color object appears large extent color constancy refers fact surface color remains largely constant intensity composition light reflected eyes object surrounding objects paper discusses neural network model lightness perception black white dimension surface color perception addressed specifically problem background constancy addressed mechanisms accomplish system exhibiting illumination constancy proposed landmark result study lightness experiment reported showed pattern lightness ratio disk annulus independent illumination neural network model lightness perception socalled ratio principle study jects perform brightness matches display paradigm striking result subjects matched increments increments increments results provide psychophysical support notion early visual system codes luminance ratios absolute luminance psychophysical results line results neurophysiology indicating cells early stages visual system encode local luminance contrast note lateral inhibition mechanisms sensitive local ratios part explanation illumination constancy power ratio principle fact early stages visual system code contrast experiments shown general ratios insufficient account surface color perception studies background constancy land role spatial layout illumination ment lightness perception effects argue sufficiency local contrast measures cross illusion neural network model presented addresses data fields neurally plausible mechanisms lateral inhibition excitation luminance ratios lightness ratio hypothesis states lightness region determined predominantly relation surfaces equally weighted relations adjacent regions propose determination lightness contrast measures adjacent surfaces partially order preserve background constancy cross pattern input stimulus gray patch cross considered depth cross gray patch depth background cross gray patch cross lighter lightness determined relation black cross patch darker lightness determined relation white background illusion discussed similar terms input stimulus mechanisms presented implement process partial contrast negation initial retinal contrast code modulated depth information retinal contrast consistent depth interpretation maintained retinal contrast supported depth fillingin model lightness models propose initial measures boundary contrast spreading neural activity fillingin compartments produce sponse profile isomorphic percept cohen grossberg grossberg mingolla neumann paper develop neural network model lightness perception theories neural network developed extension boundary contour contour system proposed cohen grossberg grossberg mingolla explain lightness data ross fundamental idea theory lateral inhibition achieves nation constancy requires recovery lightness fillingin diffusion quality lightness case final activities correspond lightness outcome interactions boundaries quality boundaries control process fillingin forming gates variable resistance diffusion visual system construct lightness percepts contrast measures obtained retinotopic lateral inhibition mechanism easily instantiated neural model straightforward modification proposal grossberg fillingin accomplished pathway modulates boundary strength boundaries surfaces objects depth leaky boundaries grossberg stimuli current usage actively increased depth boundaries partially contrast effect fillingin proceeds freely preserve lightness constancy figure describes computational stages system input onoff filtering boundaries fillingin figure model components stage contrast measurement stage neural fields lateral inhibitory connectivity measure strength contrast image gions uniform regions contrast measurement results formally field constants total excitatory input total inhibitory input terms denote discrete input gaussian weighting functions kernels analogous equation specifies field figure shows minus stage boundary detection stage oriented bound detection cells excited oriented sampling stage cells responses maximal activation strong side cells receptive field activation strong opposite side words cells tuned onoff contrast cooccurrence output stage activations cells location orientations output responses localized lateral inhibition space equation similar equation final output stage signals boundaries stage depth current implementation simple scheme determination depth configuration initially types neural network model lightness perception cells detect configurations image constant detects left positions boundary stage active similar cells detect orientations activities cells conjunction boundary signals define complete boundaries fillingin depth boundaries results depth depth stage fillingin stage contrast measures allowed diffuse space respective fillingin regions fusion blocked boundary activations stage grossberg details diffusion process modulated depth information depth information activities code depths full implementation model depth information obtained depth segmentation image supported binocular disparity monocular depth cues fillingin boundaries depths reduced strength small percentage contrast side bound leak resulting partial contrast negation reduction boundaries fillingin domains receive contrast activities stage inputs filledin simulations present model account important phenomena including effects lightness constancy contrast grossberg simulations follow address lightness effects cross figure shows simulation cross plotted gray level values fillingin reflect activities fillingin domain minus domain model correctly predicts patch cross appears lighter patch background result direct consequence contrast negation depth relationships patch cross depth cross patch background depth background depth ratio background patch cross depth boundary ratio cross patch background depth boundary smaller weight lightness computation background stronger effect appearance patch background darker time cross greater effect appearance patch cross lighter illusion illusion gray patches black stripes lighter gray patches white stripes effect considered violation simultaneous contrast contour length gray patches larger stripes simultaneous contrast predict gray patches black stripes lighter white ross boundaries depth stimulus onoff contrast filledin figure cross filledin values gray patch cross higher gray patch background gray levels code intensity darker code lower values lighter code higher values figure shows result model effect infor mation stimulus determines gray patches patches appearance determined relation contrast respective obtained modulation contrast gray patch black stripe preserved contrast patch white partially depth arrangement hypothesis showed perception lightness determined retinal depth configuration spatial layout lightness specifically proposed ratio surfaces necessarily adjacent determines lightness socalled ratio hypothesis demonstrate comparing perception lightness equivalent displays terms luminance values perceived depth relationships displays figure shows computer simulations ratio effect stimulus input simulations depth specifications depth depth specifies rightmost patch depth leftmost patches depth rightmost patches depth leftmost patch depth organization lightness central region darker configuration depth depth depth middle patch white patch patch simultaneous contrast depth middle patch contrast black patch noted depth maps simulations shown input neural network model lightness perception stimulus boundaries depth filledin figure effect filledin values gray patches black stripes higher gray patches white stripes current implementation recover depth binocular disparity employs monocular cues previous simulations conclusions paper data experiments lightness perception extend theory grossberg colleagues account challenging phenomena model initial step providing account consideration complex factors involved vision grossberg comprehensive account vision acknowledgements authors alan suggestions work supported part force office scientific research afosr office naval research supported part references lightness brightness brightness contrast reflectance variation perception psychophysics cohen grossberg neural dynamics brightness perception features boundaries diffusion resonance perception psychophysics simultaneous contrast fillingin process formation processing visual system experimental brain research ross depth filledin stimulus figure higher bottom depth filledin filledin values middle patch perceived lightness depends perceived spatial arrangement science grossberg vision figureground separation visual cortex psychophysics grossberg mingolla neural dynamics form perception boundary completion illusory figures color spreading psychological review grossberg neural dynamics brightness perception unified model classical recent phenomena perception psychophysics land lightness theory journal optical society america mingolla neumann contrast multiscale network model brightness perception vision research visual adaptation retinal gain controls progress retinal research oxford press brightness constancy nature colors jour experimental psychology white effect pattern perceived lightness perception effect background luminance brightness vision research
3 translating paul munro mary department information science university pittsburgh pittsburgh abstract network trained back propagation expressions form semantic representation munro networks performance analyzed simulations training sets english german translation attempted presenting expression network trained language generate semantic representation semantic representation presented network trained language generate introduction connectionist approaches success relative competing accounting context sensitivity attractive approach figure forward munro sions form representation spatial relationship nouns features spatial representations network trained generalized delta rule rumelhart hinton williams patterns compo nents syntactic semantic syntactic components pair nouns separated semantic component repre sentation spatial relationship translating architecture network includes encoder banks inspired hinton force development distributed representations nouns enhance performance network facilitate analysis networks function important component theory role nouns ideal meaning networks trained perform task compo nents pattern selected training presented input layer ther component missing task provide components output analysis network learning phase consists eral tests presenting accompanying nouns order tain ideal meaning comparing noun representations encoder banks noun table book school glass flowers plane road house city floor water room chip fish spatial relation touching edge embedded figure network architecture inputs presented lowest layer input banks input banks bold lines indi cate connectivity units lower bank units upper bank units represent patterns listed table methodology training sets pattern combinations formed nouns meaningful expressions chosen constitute english training corpus phrase units chosen represent position nouns relative nouns generate german corpus picked german describe spatial representa tion nouns training consists spatial pairs nouns correspondences languages training sets table munro table number correspondences english german training sets translation transforming syntactic expressions semantic representations inverting process language approach machine translation work paper wellsuited approach form transformation direction encoding decoding networks trained expressions languages attached sequence accomplish translation task syntactic triple source language presented network trained language resulting output presented nouns target language input network trained target language yielding target language output procedure assumed relative nouns easy translate translation nouns assumed dependent context translation expression house illustrated figure results networks trained procedure english language inputs german learning rates language random number generator case tests performed trained network order determine ideal networks classification nouns interaction nouns translation english german attempted test modes detail translating figure schematic view translation procedure training networks languages appropriately translated language performing task source language encoding task target language figure shows resulting activity patterns expression house system correctly translates english german contexts correspond german convergence case networks converged states average error case network learn respond correctly phrase train performance training measured computing total squared error output units training patterns errors types errors errors encoding errors decoding errors errors output units input errors output units input errors output units input errors output units input munro assessing performance network learning error measure driving training difference desired actual activity levels output unit inappropriate cases output units trained binary values informative compare relative activity output units desired pattern simply count number inputs wrong approach determine phrase processed correctly incor rectly network output errors counted identifying highly activated output unit checking matched correct number active units component training pattern varies response incorrect units active results reported table total errors training corpus table number errors task simulation ideal meanings find unmodified spatial representation associates presented individually resulting spatial responses recorded contextfree interpretation figure shows output activity spatial units simulation language results similar simulations language demonstrating network finds fairly stable representations note representations german share activation english distribution varies acti english units indicating object edge units found weakly activated german unit indicating coincidence ideal meaning tween english translating figure ideal meanings translation made translations training corpus english german german english performance network training corpus shown table maximum number phrases translated incor rectly percent correct minimum wrong percent correct fact english networks learned training corpus german networks generating semantic description nouns shows translation task translations consistently table number phrases translated simi number munro discussion highly constrained limited demonstration simulations formed databases illustrate connectionist networks capture struc tures languages interact approach machine translation shown promise practical systems based traditional linguistic theory network presented paper supports approach connection framework feasible construct space repre semantics adequately limited domain concrete representation arbitrary semantics story hand semantic representations components system event system bidirectional mappings syntax semantics examples extend learning expressions candidate machine translation general investigation anticipate extensive application back propagation neural work algorithm involve processing temporal patterns keeping dynam representation semantic hypotheses temporal scheme proposed elman acknowledgements research supported part grant author international computer science institute provided author financial support stimulating research environment summer references munro learning represent understand tional phrases conf cognitive science society elman finding structure time center research language university california diego language spatial cognition cambridge university press cambridge hinton geoffrey learning distributed representations concepts conf cognitive science society rumelhart hinton williams learning internal representa tions error propagation parallel distributed processing explorations microstructure cognition rumelhart mcclelland cambridge survey machine translation history current status future computational linguistics
1 network image segmentation color hurlbert poggio center biological information processing college department brain cognitive science laboratory cambridge abstract propose parallel network simple processors find color boundaries spatial nation spread uniform colors marked gions introduction rely color recognizing objects visual system approximate color constancy characteristics object lights step color recog nition segmenting scene regions colors require color constancy crucial step color serves simply means distinguishing object scene color differences mark material boundaries essential absolute color values goal tation algorithms achieve step object recognition finding discontinuities image mark material boundaries problems segmentation algorithms solve choose color distinguish material boundaries image give rise color edges fill uniform regions color labels ideally color labels remain constant illumination scene composition color edges occur material boundaries rubin show algorithms solve problem conditions comparing image signal distinct spectral channels side edge goal segmentation algorithms discuss find boundaries tween regions surface spectral spread uniform colors explicitly requiring colors constant illumination color labels analogous coordinates single source assumption change space hurlbert poggio surface spectral reflectance strong ties present algorithms require stage identify color label explicitly incorporated color edges luminance edges analogy psychophysics segmentation fillingin koffka ring illusion color attributed surfaces inter action operator fillingin operator interaction justified fact real world surface spectral reflectance accompanied brightness color labels assume surfaces reflect light model model image components surface reflection body reflection labels wavelength point surface image coordinates correspond illumination surface spectral reflectance factor body reflection component magnitude depends viewing geometry parameters spectral reflectance factor surface reflection component assumed constant respect true materials materials magnitude component depends strongly viewing geometry single source assumption factor illumination separate spatial spectral components multiplying spectral sensitivities color sensors integrating wavelength yields color values reflectance factors spectral channels defined sensor spectral sensitivities define note original algorithm thresholds sums differences image adjacent points paths accounts contribution edges color introducing separate luminance edge detector network image segmentation color pixel reflection reflectance factor case piecewise constant change image change mark discontinuities surface spectral reflectance function mark material boundaries conversely image regions constant correspond regions constant surface color synthetic images generated standard computer graphics algorithms reflectance model behave constant visible surface shaded sphere general approach segmentation problem find regions constant boundaries difficulty approach real data noisy unreliable quotient numbers noisy biological spectral sensitivities close goals segmentation algorithms enhance discontinuities regions marked discontinuities noise fill data unreliable explored methods meeting goals segmentation algorithms method eliminate noise fill data preserving discontinuities algorithm based markov random field techniques obtained encouraging results real images poggio technique exploits constraint piecewise constant discontinuity contours image brightness edges finding contours alternative approach cooperative network data filters noise enforcing constraint piecewise constancy network type hopfield similar cooperative stereo network marr poggio approach consists winnertakeall scheme algorithms involve loading initial values discrete bins undesirable biologically feature produce good results noisy synthetic images improved modification hurlbert class algorithms describe simple effective parallel computers connection machine averaging network avoid small step uniform surface resulting initial loading discrete bins relax local requirement piecewise hurlbert poggio figure image sphere channel vertical slice pixel region image sphere constancy require smooth regions edge input local smoothness requirement yields iterative algorithm asymptotically piecewise constant regions implement local smoothness criterion averaging scheme simply replaces pixel image average local surround iterating times image algorithm takes input image edge images luminance edges luminance edges edges edges edges edge images obtained performing edge detection thresholded directional derivative iteration pixel image replaced average contributing neighborhood neighboring pixel allowed contribute pixels sharing full border central pixel shares edge label central pixel input edge images nonzero fixed range central pixel requirement simply edge label requirement image serves input edge image edge label requirement pixels side edge averaged pixels similar averaged formally network image segmentation color pixels neighbors differ amount crossed edge edge maps assumption pixel belong edge iteration operator similar nonlinear diffusion discontinuous regularization type discussed blake geman geman marroquin iterative scheme equation derived minimization gradient descent energy function quadratic potential constant local averaging noise values spreads uniform regions marked edge inputs images shading strong algorithm performs clean segmentation regions conclusions averaging scheme finds constant regions assumptions single source strong strong highlight originate edge break averaging operation limited experience spec average disappear smoothed largely strong image reduced initial image iterative averaging scheme completely eliminates remaining powerful discrimination require specialized routines higherlevel knowledge hurlbert simple network sufficient reproduce psychophysical interaction brightness color edges enables network mimic visual koffka ring replicate illusion koffka ring uniform grey rectangular background side black white hurlbert poggio filtered lightness filter estimated hurlbert poggio figure pixel region image including image obtained iterations averaging network edge input edges luminance image threshold differences image similar values averaged vertical slice center vertical slice coordinates note scales network image segmentation color hurlbert poggio images step replaces operation obtaining cases goal eliminate spatial gradients effective illumination filtered koffka ring averaging network brightness edges input image boundary parts background continues annulus output image iterations averaging work annulus splits colors output image dark grey white half light grey black half hurlbert boundary continue annulus annulus remains uniform grey results agree human perception acknowledgements report describes research center biological information processing department brain cognitive sciences intelligence laboratory research sponsored grant office naval research cognitive neural sciences division artificial intelligence center hughes aircraft corporation sloan national science foundation artificial intelligence center hughes aircraft corporation nato scientific vision support artificial intelligence research provided advanced research projects agency department defense army contract part contract poggio supported massachusetts institute technology college references john rubin vision representing material cate artificial intelligence laboratory memo massachusetts institute technology method computing spec ular highlights journal optical society america steven color separate reflection components color research applications poggio geiger weinshall yang hurlbert vision machine proceedings image understanding workshop cambridge april morgan kaufmann mateo david marr poggio cooperative computation stereo disparity ence hurlbert computation color thesis massachusetts institute technology cambridge jose marroquin probabilistic solution inverse problems thesis institute technology cambridge hurlbert poggio andrew blake andrew visual reconstruction press bridge mass stuart geman geman stochastic relaxation gibbs distributions bayesian restoration images ieee transactions pattern analysis machine intelligence hurlbert poggio learning color algorithm examples dana anderson editor neural information processing systems american institute physics hurlbert poggio color algorithm examples science
9 dual kalman filtering methods nonlinear prediction smoothing estimation eric alex nelson department electrical engineering oregon graduate institute portland abstract prediction estimation smoothing fundamental signal processing perform tasks noisy data form time series model process generates data taking noise system explicitly account maximum likelihood kalman discussed involve dual process estimating model parameters state system review established meth linear case propose extensions utilizing dual kalman filters forwardbackward filters applicable neural networks methods compared simulations noisy time series include nonlinear noise reduction speech introduction general autoregressive model noisy time series process additive observation noise corresponds true underlying time series driven process noise nonlinear function past parameterized nelson observation additional additive noise prediction refers estimating past observations purposes paper restrict univariate time series estimation determined observations including time finally smoothing refers estimating observations past future minimum square nonlinear prediction conditional expectation time series directly data generate approximation optimal predictor generally case common approach noisy data directly leading approximation results biased predictor reduce bias predictor exploiting knowledge observations measurements arising time series estimates found estimation smoothing estimates form predictor approximates remainder paper develop methods dual estimation states show maximumlikelihood framework relate existing algorithms established linear methods extended nonlinear framework methods involving dual kalman filters proposed experiments provided compare results dual estimation noisy observations dual estimation problem requires tion standard prediction output errors observation input errors minimum prediction error error variance equals noise variance correlated observation error assuming errors gaussian minimum variance construct loglikelihood function proportional vector errors time minimization loglikelihood function leads maximumlikelihood esti estimate noise variances assume paper general optimization models trained estimated data important estimated data prediction training online data words model formed approximation provide input order avoid model mismatch dual kalman filtering methods methods method statistics literature nonlinear regression wild involves batch optimization cost function equation minor modifications made account time series model methods memory intensive approx accommodate data efficient manner retraining data order produce estimates data points ignore cross correlation prediction observation error diagonal matrix cost function expressed simply equivalent cost function weigend developed heuristic method cleaning inputs neural network modelling problems stochastic optimization assumption time series formulation lead severely biased results note estimate provided point model linear reduces standard batch weighted squares procedure solved closed form generate maximumlikelihood estimate noise free time series linear model unknown problem complicated product parameter vector vector bilinear relationship unknown quantities solving requires knowledge solving requires iterative methods solve nonlin optimization batch method typically employed method nonlinear models readily developed computational expense makes practical context neural networks kalman methods kalman methods involve reformulation problem statespace framework order efficiently optimize cost function recursive manner time point optimal estimation achieved combining prior prediction observation connor proposed extended kalman filter neural network perform state estimation posed weight estimation statespace framework kalman training neural network extend ideas include dual kalman estimation states weights efficient maximumlikelihood optimization introduce forwardbackward information filters relationships methods statespace formulation equations nelson model linear takes form written controllable canonical form model linear parameters kalman filter algorithm readily estimate states lewis time step filter computes linear squares prediction error covariances linear case gaussian statistics estimates minimum square estimates prior information reduce maximumlikelihood estimates note kalman filter maximumlikelihood instant time past data approach batch method smoothed estimate data estimates final time step match exact equivalence time achieved combining kalman filter backwards information filter produce forwardbackward smoothing filter lewis effectively inverse variance propagated backwards time form backwards state estimates combined forward estimates data large filter offers significant computational advantages batch form model nonlinear kalman filter applied directly requires linearization nonlinear model time step resulting algorithm extended kalman filter effectively approxi nonlinear function timevarying linear batch iteration unknown models linear model unknown bilinear relationship time series estimates weight requires iterative optimization approach referred kalman filter estimate fixed leastsquares optimization current specifically parameters estimated matrix state estimates vector observations nonlinear models feedforward neural network approximate replace procedures backpropagation extended kalman filter referred connor disadvantage approach slow convergence keeping inaccurate estimates fixed batch optimization stage dual kalman filter approach unknown models joint state vector model time series estimated simultaneously applying nonlinear joint state equations linear case algorithm convergence problems alternative construct separate statespace formulation underlying weights slight modification cost equation account initial conditions kalman form dual kalman filtering methods state transition simply identity matrix plays role timevarying nonlinear observation unknown model linear observation takes pair dual kalman filters parallel state estimation weight estimation nelson time step current estimates dual approach essentially separate nonlinear optimization linear assumptions remain uncorrelated statistics remain gaussian note error filter accounted developed approaches address coupling present sake brevity equation short write variance noise replace equation estimation note ability couple statistics manner batch approaches extend method nonlinear neural network models dual extended kalman filtering method simply requires neural network computed filters time step note feeding network implicitly recurrent network forwardbackward methods kalman methods reformulated forwardbackward kalman filtering improve state smoothing dual kalman methods require forward backward state estimates order generate smooth update time step addition estimates requires nature lead biased specifically weights computed matrix smooth state estimates equivalent adjustments made dual kalman methods model system required nonlinear case results algorithms future publication experiments table compares approaches linear time series linear model unknown square estima tion weights bottom represents baseline performance noise model training predictions interpreted carefully training data optimize weights methods perform training recall issue kalman methods online test kalman filters continue operate weight estimates fixed forwardbackward method improves performance methods cost function state weight estimation improved prediction resulting test performance significantly worse time series compare nonlinear methods results summarized table conclusions parallel linear case note method performed baseline provided standard backprop nelson table comparison methods linear models model train test train test model unknown values estimation prediction weights normalized training samples testing signal model samples computed unknown model memory training testing constraints model table comparison methods nonlinear time series train test train test train test series generated autoregressive neural networks exhibit limit training samples testing cycle chaotic behavior samples network models inputs hidden units crossvalidation methods agation model noise method exhibited fast convergence requiring epochs training method development tested speech signal corrupted simulated bursting white noise figure method applied successive point windows signal window starting points results figure computed assuming average improved experiment estimated noisy signal nelson improvement comparison stateoftheart techniques spectral subtraction processing achieve improvements extend algorithms colored noise case paper nelson conclusions methods kalman framework dual estima tion states weights noisy time series methods utilize dual kalman filtering methods clean speech noise noisy speech cleaned speech figure cleaning noisy speech shown process observation noise models improve estimation performance work progress includes extensions colored noise blind signal separation forward backward filtering noise estimation study needed dual extended kalman filter methods neural network prediction estimation smoothing offer potentially powerful tools signal processing applications work sponsored part grant grant references suppression acoustic noise speech spectral subtraction ieee april connor martin atlas recurrent neural networks robust time series prediction ieee neural networks march lewis optimal estimation john wiley sons york adaptive filtering prediction control prenticehall englewood cliffs speech enhancement based temporal processing icassp proceedings nelson neural speech enhancement dual extended kalman submitted nelson simultaneous online estimation parameters states linear systems ieee automatic control february neural control nonlinear dynamic systems kalman filter trained recurrent networks ieee wild nonlinear regression john wiley sons weigend university colorado computer ence technical report
10 rectified gaussian distribution seung bell laboratories technologies murray hill abstract simple powerful modification standard gaussian tribution studied variables rectified gaussian constrained nonnegative enabling nonconvex functions multimodal examples competitive cooperative distributions illustrate power rectified gaussian cooperative distribution translations pattern demonstrates potential rectified gaussian modeling pattern manifolds introduction rectified gaussian distribution modification standard gaussian variables constrained nonnegative simple modification brings increased representational power illustrated multimodal examples rectified gaussian competitive cooperative distributions modes competitive distribution regions probabil modes cooperative distribution closely spaced nonlinear continuous manifold distribution accurately approximated single standard gaussian short rectified gaussian represent discrete continuous variability standard gaussian increased representational power price increased complexity finding mode standard gaussian involves solution linear equations finding modes rectified gaussian quadratic programming problem sampling standard gaussian generating dimensional normal linear transformation sampling rectified gaussian requires monte carlo methods sampling algorithms basic tools important probabilistic modeling boltzmann rectified gaussian undirected graphical model rectified gaussian representation probabilistic modeling rectified gaussian distribution figure types quadratic energy functions saddle continuousvalued data unclear learning tractable rectified gaussian boltzmann machine version rectified gaussian recently introduced hinton version single variable singularity origin designed produce sparse activity directed graphical models version lacks singularity interesting case variable relies undirected interactions variables produce multimodal behavior interest present work inspired biological neural network models contin uous dynamical energy function cooperative distribution previously studied models visual motor head direction energy functions saddle standard gaussian distribution defined symmetric matrix vector define quadratic energy function parameter inverse temperature lowering temperature concentrates distribution minimum energy function normalizes integral unity depending matrix quadratic energy function types curvature energy function shown figure convex mini energy corresponds peak distribution distribution pattern recognition applications patterns single prototype corrupted random noise energy function shown figure direction patterns generated distribution roughly equal likelihood direction corresponds invariances pattern principal component analysis thought procedure learning distributions form energy function shown figure gaussian distribution energy decreases limit seung sides saddle leading distribution energy functions rectified gaussian distribution defined vectors components nonnegative class energy functions matrix property condition note matrices larger positive definite matrices standard gaussian constraints block directions energy diverges negative infinity concrete examples discussed shortly energy functions examples multiple minima distribution multimodal standard gaussian defining distributions introduce tools modes rectified gaussian minima energy function subject constraints temperatures modes distribution characterize behavior finding modes rectified gaussian problem quadratic programming algorithms quadratic programming simple case constraints simplest algorithm projected gradient method discrete time dynamics consisting gradient step fication rectification nonnegative step size chosen correctly algorithm provably shown converge stationary point energy practice stationary point generally local minimum neural networks solve quadratic programming problems define synaptic weight matrix continuous time dynamics initial condition nonnegative dynamics remains nonnegative quadratic function lyapunov function dynamics methods converge stationary point energy gradient energy conditions stationary point satisfy conditions intuitive explanation interior constraint region gradient vanish boundary gradient point interior stationary point local minimum conditions augmented condition hessian nonzero variables positive definite methods guaranteed find global minimum case positive definite energy function convex convex energy function unique minimum convex quadratic programming solvable polynomial time contrast nonconvex energy function generally find global minimum polynomial time presence local minima practical situations difficult find reasonable solution rectified gaussian distribution figure competitive distribution variables nonconvex energy function constrained minima axes shown contours constant energy arrows represent negative gradient energy rectified gaussian distribution peaks rectified gaussian interesting nonconvex case possibility multiple minima consequence multiple minima multimodal distribution stan dard gaussian examples multimodal rectified gaussian competitive distribution competitive distribution defined simple case energy function constrained minima shown figure lead distribution constraints imposed constrained minima nonconvex energy function correspond peaks distribution bimodal distribution approximated mixture standard gaussians single gaussian distribution approximate distribution reduced probability density peaks single gaussian competitive distribution energy function similar govern winnertakeall large global minima energy function vectors component equal unity rest competitive interaction components temperature distribution eigenvalues covariance seung figure competitive distribution variables mode temperature state distribution strong competition vari ables results variable modes form winner variable sample finite temperature monte carlo sampling clear winner variable sample standard gaussian matched covariance negative values sample bears resemblance states shown clear winner variable equal single mode mode vector eigenvectors span dimensional space perpendicular figure shows samples drawn finite temperature competitive distribution drawn standard gaussian distri bution covariance sample standard gaussian negative values sample original distribution importantly standard gaussian capture strongly competitive character distribution cooperative distribution define cooperative distribution variables angle variable variables regarded ring energy function defined coupling depends separation ring minima ground states energy function found numerically methods earlier analytic calculation ground states large limit shown figure ground state activity centered angle ring pattern activity modes competitive distribution arises cooperative interactions neurons ring distribution invariant rotations ring cyclic permutations variables ground states angle covariance cooperative distribution const sample shown figure completely uniform samples generated gaussian distribution rectified gaussian distribution figure cooperative distribution variables temperature state cooperative interaction variables leads pattern activity locations ring finite temperature sample sample standard gaussian matched covariance covariance completely ground states cooperative distribution deviations standard gaussian behavior reflect fundamental differences underlying energy function energy function discrete minima arranged ring limit large minima small reasonable approximation regard energy function continuous line minima ring words energy surface curved similar bottom centroid ring close minimum cooperative distribution model translations pattern activity suggests rectified gaussian invariant object recognition cases continuous manifold instantiations object modeled case visual object recognition images object viewpoints form continuous manifold sampling figures depict samples drawn competitive cooperative distri bution samples generated metropolis monte carlo algorithm full descriptions algorithm found give description features basic procedure generate configuration system calculate change energy energy decreases accepts configuration increases configuration accepted probability sampling algorithm variable updated time analogous single spin flips acceptance ratio higher update spins simultaneously distributions energy function approximately marginal directions directions barrier cooperative distribution property expect critical slowing sort collective update analogous updates cluster updates make sampling efficient type update depend energy function easy determine seung discussion competitive cooperative distributions examples rectified gaussians good approximation standard gaussian distributions approximated mixtures standard gaussians competitive distribution approximated mixture gaussians state cooperative distribution approximated mixture gaussians location ring approximation reduce number gaussians mixture make rectified gaussian superior mixture models empirical question investigated empirically specific realworld modeling tasks intuition rectified gaussian turn good representation nonlinear pattern manifolds paper make intuition concrete make rectified gaussian practical applications critical find tractable learning algorithms clear learning tractable rectified gaussian boltzmann machine continuous variables rectified gaussian easier work binary variables boltzmann machine acknowledgments saul sompolinsky helpful discussions work project supported bell laboratories technologies references ackley hinton sejnowski learning algorithm boltzmann machines cognitive science hinton ghahramani generative models discovering sparse distributed representations phil trans ghahramani hinton hierarchical nonlinear factor analysis topographic maps neural info proc syst seung brain eyes proc natl acad sompolinsky theory orientation tuning visual cortex proc acad georgopoulos cognitive neurophysiology motor cortex science zhang representation spatial orientation intrinsic dynamics headdirection cell ensemble theory neurosci bertsekas nonlinear programming scientific belmont amari arbib competition cooperation neural nets editor systems neuroscience pages academic press york hinton dayan revow modeling manifolds images handwritten digits ieee trans neural networks
1 learning choice internal grossman meir domany department electronics institute science israel abstract introduce learning algorithm multilayer neural works composed binary linear threshold elements algorithms reduce learning process minimizing cost function weights method treats internal repre sentations fundamental entities determined correct internal representations arrived weights found local biologically plausible perceptton learning rule tested learning algorithm problems symmetry parity combined introduction network binary linear threshold elements state determined rule unidirectional weight assigned connection unit local bias focus attention feedforward networks units layer determine states units hidden layer turn feed output elements typical task network single output input layer state belongs category input space basic problem learning find algorithm produces weights enable network perform task absence hidden units learning accomplished rosenblatt briefly describe source units single target unit source units patterns require target unit determined takes values learning takes place training session starting arbitrary initial guess weights input presented resulting output taking modify weight rule grossman meir domany parameter modify bias input pattern presented inputs draw correct output perceptton convergence states rosenblatt minsky find solution exists finite number steps partitions input space small subset linearly separable lewis single layer perceptrons hidden units added single hidden layer large number units inserted input output classification problem solution architectures implemented network clear connection error corrective action backpropagation problem dealing networks continuous valued units response function continuous sigmoid learning consists gradientdescent type minimization cost function measure deviation actual outputs required space weights version back propagation desired states bears similarity algorithm recently introduced plaut widrow winter related methods algorithm views internal representations inputs basic independent variables learning process conceptually plausible assumption learning biological artificial system form maps representations external world representations formed weights found simple local hebbian learning rules problem learning searching proper internal representations minimization failure converge solution indication current guess internal representations modified algorithm internal representations states hidden layer patterns training presented weights found problem learning choosing proper internal representations minimizing cost function varying values weights demonstrate classification prob output values required response input patterns solution found maps input internal represen tation generated hidden layer turn produces correct output imagine supplied weights solve problem correct internal representations revealed table rows input bits state hidden layer obtained response input pattern view hiddenlayer cell target cell inputs viewed source sufficient time converge learning choice internal representations weights connecting input unit hidden unit input output association appears column table realized similar fashion yield weights learning process hidden layer source output unit target order solve learning search procedure space internal representations table generate solution updating weights parallel layers current table internal representations present algorithm process broken distinct stages generate table internal representations presenting input pattern training calculating state hidden existing couplings hidden layer cells source output target unit current table internal representations training find weights obtain desired outputs solution found problem solved stop learning sweeps current weights generate table internal representations yields correct outputs presenting table sequentially hidden layer wrong output obtained internal representation changed wrong output means field produced hidden layer output unit large small randomly pick site hidden layer flip sign direction replace entry table picking sites changing internal representation pattern correct output generated generate correct output provided case learning process procedure ends modified table guess internal representations apply layer serving source treating hidden layer site separately target input training presented layer check correct result produced unit network wrong output hidden unit modifying weights incident column table desired states unit input yield correct output insert current state hidden layer internal representation pattern learning steps sweep manner training modifying weights input hidden layer hiddenlayer thresholds explained internal grossman meir domany representations network achieved performance entire training learning completed solution found sweeps training stage present values start fairly complete account procedure grossman added parameters arbitrary introduced guarantee stage solution found clear solution exists weights current table internal representations stage converge time limit table internal representations formed parameters large find solution exists sufficiently high probability hand large values force algorithm execute long search solution exists values parameters determined optimizing performance network experience reasonable range values found performance fairly insensitive precise choice integer weights correction step size constant binary units scaled unity setting integer loss generality optimization algorithm parameters optimized performance parameters section time limit upper bound total number training sweeps training parameters increment weights thresholds stage values weights thresholds initial random values weights interval thresholds integer weights program parameters treating multiple outputs version internal representations find yields correct output error pattern output unit prespecified number attempted flips pattern vanishing error achieved modified version introduce slightly restrictive criterion accepting rejecting flip chosen random hidden unit check effect sign total output error number wrong bits output field output error increased flip accepted table internal representations changed modified algorithm applicable networks preliminary experiments version presented section learning choice internal representations performance algorithm time parameter measuring performance number sweeps training patterns needed order find solution times pattern presented network cycle algorithm sweeps problem parameter choice ensemble independent runs starting random choice initial weights created general applying learning algorithm problem cases algorithm fails find solution time limit stuck local minimum impossible calculate ensemble average learning times calculate performance measure median number sweeps inverse average rate defined tesauro problem studied system determine number contiguous blocks input equal called denker versus predicate training inputs learning cycles parametrized keeping fixed varied cases data point chir figure median number sweeps needed train network input units exhaustive training solve predicate plotted number hidden units results backpropagation denker work shown meir domany problem symmetry requires inputs solved hidden units presents median number exhaustive training sweeps needed solve problem input size point cases found solution cycles figure median number sweeps needed train networks symmetry parity problem requires number bits input order compare performance algorithm studied parity problem networks architecture chosen tesauro integer version algorithm briefly version algorithm weights thresholds integers increment size thresholds weights unity initial condition chose randomly simulation version input patterns presented sequentially fixed order perceptton learning sweeps results presented table choices parameters mentioned table success rate algorithm didnt fail find solution maximal number training cycles table results tesauro table note local minima percentage occurrences reported learning choice internal representations testing multiple output version algorithm combined parity symmetry problem network output units connected hidden units output unit performs parity predicate input performs symmetry predicate network architecture results table choice parameters table table parity architecture table parity symmetry architecture discussion presented learning algorithm twolayer percepttons searches internal representations training determines weights local hebbian perceptton learning rule learning choice internal represen tation turn situations teacher information desired internal representations demonstrated algorithm works typical problems studied manner training time varies network size comparisons backpropagation made noted training sweep involves computations backpropagation presented generalization algorithm networks multiple outputs found functions problems kind discussed appears modification needed deal multiple outputs enables solve learning problem network architectures hidden layer grossman meir domany point offer limited discussion interesting tion algorithm work finds correct internal representations tables constitute small fraction total number main reason procedure search entire space tables large space small subspace target tables obtained choices rule response presentation input patterns small subspace tables potentially produce required output solutions learning problem constitute space algorithm iterates executing walk induced modification weights appealing feature algorithm implemented manner weights thresholds makes analysis behavior network easier exact number bits system constructing solution errors point view hardware implementation feasible work integer weights extending work directions present method learning stage bits memory internal representations training patterns stored feature biologically limiting developing method require memory directions current study include extensions networks continuous variables networks feedback references denker schwartz solla hopfield howard jackel systems grossman meir domany systems press organization behavior wiley lewis minsky plaut rosenblatt rumelhart tesauro widrow proc threshold logic wiley york percepttons cambridge nowlan hinton tech report principles neurodynamics york hinton williams nature systems winter computer
12 oscillatory correlation framework computational auditory scene analysis brown department computer science university street email brown wang department computer information science centre cognitive science ohio state university email abstract neural model oscillatory correlation speech interfering sound sources core model twolayer neural oscillator network sound stream represented synchronized population oscillators streams represented oscillator populations model evaluated corpus speech mixed interfering sounds produces improvement signaltonoise ratio mixture introduction speech heard isolation mixed environmental sounds auditory system parse acoustic mixture reaching ears order retrieve description sound source process termed auditory scene analysis conceptually regarded twostage process stage term segmentation decomposes acoustic stimulus collection sensory elements stage grouping elements environmental event combined perceptual structure called stream streams interpreted higherlevel cognitive processes recently growing interest development computational systems mimic computational auditory scene analysis systems inspired auditory function model closely employ symbolic search highlevel inference performance systems encouraging match abilities human tend complex computationally intensive short remains problem realtime applications automatic speech recognition human concurrent sounds apparent ease computational systems closely modelled neurobiological mechanisms hearing offer performance advantage existing systems observation desire understand neurobiological basis investigators propose neural network models recently brown wang account concurrent vowel separation based oscillatory correlation framework oscillators represent perceptual stream synchronized phase locked phase oscillators represent streams evidence oscillatory correlation theory neurobiological studies report oscillations auditory visual olfactory review brown wang paper propose neural network model oscillatory correlation underlying neural mechanism streams formed oscillators twodimensional timefrequency network model evaluated task involves separation timevarying sounds extends previous study considered segregation vowel sounds static spectra model description input model consists mixture speech interfering sound source sampled rate resolution input signal processed stages detailed account peripheral auditory processing peripheral auditory frequency selectivity modelled bank filters center frequencies equally distributed equivalent rectangular bandwidth scale subsequently output filter processed model hair cell function output hair cell model probabilistic representation auditory nerve firing activity auditory representations mechanisms similar underlying pitch perception contribute perceptual separation sounds fundamental frequencies stage model extracts periodicity information simulated auditory nerve firing patterns achieved computing running autocorrelation auditory nerve activity channel forming representation correlogram time step autocorrelation channel time output hair cell model rectangular window width time steps window width autocorrelation computed steps sampling period maximum delay equation computed time frames intervals intervals steps time index periodic sounds characteristic appears correlogram centered stimulus period figure structure emphasized forming pooled correlogram exhibits prominent peak delay perceived pitch extract formants correlogram frequency channels excited acoustic component share similar pattern periodicity bands coherent periodicity identified adjacent correlogram channels regions high correlation harmonic formant crosscorrelation channels time defined autocorrelation function normalized unity variance typical crosscorrelation function shown figure oscillatory correlation neural oscillator network overview segmentation grouping place twolayer oscillator network figure basic unit network single oscillator defined connected excitatory variable inhibitory variable layer network takes form timefrequency grid index oscillator frequency channel time frame represents external input oscillator denotes coupling oscillators network parameters amplitude gaussian noise term coupling noise held constant defines relaxation oscillator time scales cubic function sigmoid function intersect point middle branch cubic chosen small case oscillator exhibits stable limit cycle small values referred enabled limit cycle alternates silent active phases steadystate behaviour compared motion phase phases takes place rapidly referred intersect stable fixed point case oscillation occurs oscillations neural oscillator network segment layer layer network segments formed blocks oscillators trace evolution acoustic component time frequency layer twodimensional timefrequency grid oscillators global figure coupling term defined heaviside function connection weight oscillator oscillator nearest neighbors threshold chosen oscillator influence grouping layer segment layer global autocorrelation figure correlogram mixture speech telephone start stimulus pooled correlogram shown bottom panel cross correlation function shown structure twolayer oscillator network brown wang neighbors active phase weight neighboring connections time axis uniformly connection weight oscillator vertical neighbor exceeds threshold weight inhibition global defined oscillator threshold small segments form correspond perceptually significant acoustic components order remove noisy fragments introduce lateral potential oscillator defined threshold called potential neighborhood chosen neighbors active approaches fast time scale slow time scale determined lateral potential plays role gating input oscillator specifically replace initialized drop threshold oscillator receives excitation entire potential neighborhood choice neighborhood implies segment extend consecutive time frames oscillators stimulated maintain high potential background noisy activity oscillator stimulated input oscillators stimulated energy correlogram channel exceeds threshold evident energy correlogram channel time corresponds figure shows segmentation mixture speech telephone network simulated legion algorithm producing segments represented distinct gray level background shown black convenience show segments figure arises unique time interval time seconds time seconds figure segments formed layer network mixture speech telephone categorization segments gray pixels represent white pixels represent regions agree oscillatory correlation neural oscillator network grouping layer layer twodimensional network laterally coupled oscillators global inhibition oscillators layer stimulated oscillator layer stimulated form part background initially oscillators phase implying segments layer allocated initialization consistent psychophysical evidence suggesting perceptual fusion default state auditory organisation layer oscillator form changed small positive parameter implies oscillator high lateral potential slightly higher external choose oscillators correspond longest segment layer jump active phase longest segment identified mechanism coupling term consists types coupling represents mutual excitation oscillators segment active oscillators segment occupy half segment active oscillator segment coupling term denotes vertical connections oscillators frequency segments time frame time frame estimated pooled correlogram classify frequency channels categories channels consistent channels figure delay largest peak occurs pooled correlogram channel time frame energy correlogram channel time amounts classification basis energy delay found winnertakeall network simplicity apply maximum selector time seconds time seconds figure snapshot showing activity layer shortly start simulation active oscillators white pixels correspond speech stream snapshot shortly active oscillators correspond telephone stream brown wang classification process operates channels segments result channels segment time frame allocated categories segments decomposed enforce rule channels frame segment belong category majority channels step vertical connections formed time frame oscillators segments mutual excitatory links channels belong category mutual inhibitory links receives input inhibitory links similarly receives input vertical excitatory links present model mechanism grouping segments overlap time limit operation layer time span longest segment forming lateral connections trimming longest segment network numerically solved singular limit method figure shows response layer mixture speech telephone figure shows snapshots layer white pixel active oscillator black pixel silent oscillator network quickly forms synchronous blocks figure shows snapshot oscillator block stream segregated speech active phase figure shows subsequent snapshot oscillator block telephone active phase activity layer network embodies result components acoustic mixture separated information represented oscillatory correlation stage model path output divided sections overlapping raised cosine weighting applied section unity oscillator active phase weighted filter outputs summed channels yield waveform type intrusion type figure black grey separation model results shown voiced speech mixed tone random noise noise bursts noise music telephone female speech male speech female speech percentage speech energy recovered mixture separation model oscillatory correlation evaluation model evaluated mixtures speech noise mixtures obtained adding waveforms voiced utterances sounds separate speech noise waveforms signaltonoise ratio computed mixture estimated processing model separated speech noise waveforms path separation model shown figure averaged utterances noise condition dramatic improvements obtained interfering noise tone tend represented single segment segregated effectively speech source informal tests suggest speech good quantified percentage speech energy recovered segregation process typically figure discussion significant feature model proposed stage neurobiological foundation peripheral auditory model based filter derived physiological measurement auditory nerve impulse responses similarly auditory representations consistent physiology higher auditory system model based framework oscillatory correlation supported recent neurophysiological findings neural oscillator network performs distributed manner oscillator behaves autonomously parallel oscillators issues realtime implementation model resolved real possibility oscillator network implemented analog vlsi feature attractive high speed compact size analog vlsi needed provide effective frontend automatic speech recognition systems references brown cooke computational auditory scene analysis computer speech language bregman auditory scene analysis cambridge press brown wang modelling perceptual segregation double vowels network neural oscillators neural networks cooke modelling auditory processing organization cambridge cambridge university press computational auditory scene analysis dissertation department electrical engineering computer science wang fast numerical integration relaxation oscillator networks based singular limit solutions ieee transactions neural networks wang global competition local cooperation network neural oscillators physica wang primitive auditory segregation based oscillatory correlation cognitive science wang object selection based oscillatory correlation neural networks wang brown separation speech interfering sounds based oscillatory correlation ieee transactions neural networks wang image segmentation based oscillatory correlation neural computation neural computation
10 modeling acoustic correlations factor analysis lawrence saul labs research park park abstract hidden markov models hmms automatic speech recognition rely high dimensional feature vectors summarize short time properties speech correlations features arise speech signal nonstationary corrupted noise investigate model correlations factor analysis statistical method dimensionality reduction factor analysis small number parameters model covariance struc ture high dimensional data parameters estimated expectationmaximization algorithm training procedures hmms evaluate combined mixture densities factor analysis hmms recognize strings holding total number parameters fixed find methods properly combined yield models method introduction hidden markov models hmms automatic speech rely high dimensional feature vectors summarize acoustic properties speech vary recognizer recognizer spectral infor mation frame speech typically feature vector thirty dimensions systems vectors conditionally modeled mixtures gaussian probability density functions pdfs case corre lations features represented implicitly mixture components explicitly elements covariance matrix naturally strategies modeling correlations implicit versus tradeoffs accuracy speed memory paper examines tradeoffs statistical method factor analysis saul present work motivated observation based recognizers include explicit modeling correlations hidden states acoustic features modeled mixtures gaussian pdfs diagonal covariance matrices reasons practice full covariance matrices imposes heavy computational making difficult achieve realtime rarely data reliably estimate full covariance matrices overcome sharing covariance matrices states models drawbacks considerably training procedure requires states tied unconstrained diagonal covariance matrices represent extreme choices hidden markov modeling speech statistical method factor represents compromise extremes idea factor analysis systematic variations data lower dimensional subspace enables represent compact covariance high dimensional data matrices expressed terms small number parameters model significant correlations ring overhead time memory maximum likelihood estimates parameters obtained expectationmaximization algorithm embedded training procedures hmms paper investigate factor analysis continuous density hmms applying factor analysis state mixture component results powerful form dimensionality reduction tailored local properties speech briefly organization paper section review method factor analysis describe makes attractive large problems speech recognition section report experiments speaker independent recognition connected finally section present conclusions ideas future research factor analysis factor analysis linear method dimensionality reduction gaussian random forms dimensionality reduction including imple mented neural networks understood variants factor analysis close ties methods based principal components analysis notion tangent combined mixture densities factor nonlinear form dimensionality applied hinton modeling handwritten digits procedure mixtures factor analyzers subsequently derived describe method factor analysis gaussian random variables show applied hidden markov modeling speech gaussian model denote high dimensional gaussian random variable simplicity assume number dimensions large prohibitively expensive estimate store multiply invert full covariance matrix idea factor analysis find subspace lower dimension captures variations denote dimensional gaussian random variable modeling acoustic correlations factor analysis identity covariance matrix imagine variable generated random process latent hidden variable elements factors denote arbitrary matrix denote diagonal matrix imagine generated sampling computing ddimensional vector adding independent gaussian noise variances component vector matrix factor loading matrix relation captured conditional distribution found integrating hidden variable denotes average respect posterior distribution mstep algorithm maximize hand marginal distribution calculation straightforward gaussian distributed covariance matrix diagonal elements small variation occurs subspace spanned columns variances measure typical size subspace covariance matrices form number importantly expressed terms small number parameters nonzero elements storing requires memory storing full covariance matrix likewise estimating requires data estimating full covariance matrix covariance matrices form efficiently inverted matrix inversion identity matrix decomposition probability multiplies opposed multiplies required covariance matrix maximum likelihood estimates parameters obtained denote sample data points procedure iterative procedure maximizing loglikelihood estep procedure compute hand side depends saul side respect leads iterative number data points constrained purely updates guaranteed converge monotonically possibly local maximum loglikelihood hidden markov modeling speech continuous density feature vectors conditioned hidden states modeled mixtures gaussian pdfs dimensionality feature space large make parameterization mixture component obtains means variances factor loading matrix amount total parameters mixture model number mixture components number factors dimensionality feature space note models capture feature correlations ways implicitly mixture components explicitly factors intuitively expects mixture components model discrete types variability speaker male female factors model continuous types variability noise types variability important building accurate models speech straightforward integrate algorithm factor analysis training hmms suppose represents sequence acoustic vectors forwardbackward procedure enables compute posterior probability state mixture component time updates matrices state mixture component essentially form observation weighted posterior probability additionally account mixture components nonzero complete derivation updates additional details longer version paper important consideration applying factor analysis speech choice acoustic features standard choice dimensional feature vector consists twelve coefficients derivatives normalized derivatives features types coefficients correlations motivated factor analysis worth emphasizing method applies arbitrary feature tors features summarize properties speech expects correlations arise background noise speaker experiments continuous density hmms diagonal factored covariance matrices trained recognize strings highly modeling acoustic correlations factor analysis parameters parameters figure plots loglikelihood scores word error rates test versus number parameters mixture model divided number features stars models diagonal covariance matrices circles models factor analysis dashed lines connect recognizers table letters make challenging problem speech recognition training test data recorded telephone network consisted utterances recognizers built lefttoright hmms trained maximum likelihood estimation modeled contextdependent unit testing free grammar network grammar constraints experiments varying number mixture components number factors goal determine model acoustic feature correlations table summarizes results experiments columns left show number mixture components number factors number parameters mixture model divided feature dimension word error rates including insertion deletion errors test average loglikelihood frame speech test time recognize twenty test utterances surprisingly word accuracies likelihood scores increase number modeling parameters likewise times interesting comparisons models number mixture components factors versus mixture components factors left graph figure shows plot average loglikelihood versus number parameters mixture model stars circles plot models diagonal covariance matrices sees plot fixed number parameters models factored covariance matrices tend higher likelihoods graph figure shows similar plot word error rates versus number parameters difference hmms poor models speech begin higher likelihoods necessarily translate lower error rates return point worth noting experiments fixed number factors mixture component fact variability speech highly context dependent makes sense vary number factors states simple heuristic adjust number factors depending amount training data state determined initial segmentation training utterances found heuristic pronounced saul word error loglikelihood time table results recognizers columns number mixture components number factors number parameters mixture model divided number features word error rates average likelihood scores test time recognize twenty utterances word error loglikelihood time table results recognizers variable numbers factors denotes average number factors mixture component differences likelihood scores error rates substantial improve ments observed recognizers hmms employed average factors mixture component dashed lines figure table results reader notice recognizers extremely competitive aspects performance accuracy memory baseline factor models table discussion paper studied combined mixture densities factor analysis speech recognition framework hidden markov modeling acoustic features conditionally modeled mixtures gaus sian pdfs shown mixture densities factor analysis means modeling acoustic correlations lead smaller faster accurate recognizers method compare lines tables modeling acoustic correlations factor analysis issues investigation increases likelihood scores correspond reductions error rates common occurrence automatic speech recognition gating discriminative training hmms factor analysis idea optimize objective function directly relates goal minimizing classification errors important extend results large vocabulary tasks speech recognition extreme sparseness data tasks makes factor analysis appealing strategy dimensionality tion finally questions limited number parameters allocate factors mixture components cepstral features hmms throw informa tive correlations speech signal correlations modeled factor analysis answers questions lead improvements performance acknowledgement grateful labs ghahramani university toronto bell labs discussions labs providing initial segmentation training utterances references rabiner juang speech recognition wood cliffs prentice hall importance cepstral parameter correlations speech recognition computer speech language tied mixture continuous parameter modeling speech recognition ieee transactions acoustics speech signal processing rubin algorithms factor analysis introduction latent models london chapman hall hinton dayan revow modeling manifolds images handwritten digits ieee transactions neural networks ghahramani hinton algorithm mixtures factor analyzers university toronto technical report simard lecun denker efficient pattern recognition transformation distance cowan hanson giles advances neural information processing systems cambridge press press teukolsky numerical recipes scientific computing cambridge cambridge versity press brown mercer maximum mutual information estimation hidden markov model parameters speech recog nition proceedings icassp
11 precise characterization class languages recognized neural nets gaussian common noise distributions maass inst theoretical computer science technische email sontag mathematics rutgers university email sontag abstract recurrent analog neural nets gate subject gaussian noise common noise distribution probabil density function nonzero large show regular languages recognized networks type language begins give precise characterization languages recognized result implies severe constraints possibilities constructing recurrent neural nets robust realistic types analog noise hand present method analog neural nets robust regard analog noise type introduction fairly large literature giles references devoted construction analog neural nets recognize regular languages physical realization analog computational units analog neural biological systems bound encounter form analog noise analog computational units show article effect quences computational power recurrent analog neural nets show analog neural computational units subject gaussian common noise distributions recognize arbitrary regular languages analog neural recognize regular language begins partially supported project maass sontag precise characterization regular languages recognized analog neural nets theorem section introduce simple technique making feedforward neural nets robust regard types analog noise method employed prove positive part theorem main difficulty proving theorem negative part adequate theoretical tools introduced section give exact statement theorem discuss related preceding work give precise definition computations noisy neural networks conceptual point view definition basically computations noisy boolean circuits involved deal infinite state space illustrate definition concrete case recurrent sigmoidal neural gaussian noise full generality result makes applicable large class types analog computational systems analog noise recurrent sigmoidal neural consisting units receives time step input finite alphabet internal state step vector consists outputs sigmoidal units step computation step network represent weight matrix vectors sigmoidal activation function applied vector component sequence drawn independently gaussian distribu tion analogy case noisy boolean circuits network recognizes language reliability constant immediately reading arbitrary word network probability accepting state case probability accepting state case show article parameters gaussian noise distribution sigmoidal unit determined designer neural impossible find size weight matrix vectors reliability resulting recurrent sigmoidal neural gaussian noise accepts simple regular language begins reliability result exhibits fundamental limitation making recurrent analog neural noise robust case noise distribution benign type negative result large number techniques making feedforward boolean circuit robust noise negative result turns general nature holds virtually related definitions noisy analog neural nets completely models analog computation presence gaussian similar noise state compact arbitrary compact measurable fixed sigmoidal activation function gaussian distributed noise vector suffices assume arbitrary measurable function random variable density wide support order define computation system definition network reading accepting state probability strictly recognize language precisely assume exists subset constant analog neural nets gaussian noise stochastic kernel defined prob signed measure signed measure defined note probability measure sequence inputs composition evolution operators probability distribution states instant measure distribution states single computation step input computation steps inputs distribution notation system initial state distribution states computation steps probability measure concentrated measurable subset input initial state accepting final states reliability level resulting noisy analog computational system recognizes language general neural network simulates carry fixed number computation steps transitions form input symbol constructions giles section article easily reflected model formally replacing input sequence sequence blank denotes sequence copies arbitrarily fixed completes definition language recognition noisy analog compu tational system discrete time definition essentially agrees maass employ common notations formal language theory write concatenation strings strings finite number strings strings main result article theorem assume arbitrary finite alphabet language recognized noisy analog computational system previously type finite subsets version theorem discrete computational systems previously shown precisely shown probabilistic automata strictly positive matrices recognize class languages occur theorem referred languages definite languages language recognition analog computational systems analog noise previously casey special case bounded noise perfect reliability properties hold consisting differences finite nonzero measure maass small terminology maass general case shown maass system recognize regular languages shown small regular languages recog systems present paper focus complementary case condition small satisfied analog noise move states larger distances state space show probability event arbitrarily small neural longer recognize arbitrary regular languages constraint language recognition prove section result arbitrary noisy computational systems section theorem assume arbitrary alphabet language subsets integer words string belongs language decided symbols general fact stochastic kernels measure space stochastic kernel special case signed signed measure defined observe probability measure arbitrary satisfies condition constant probability measure necessarily special case condition denote total variation signed recall decompose disjoint union sets manner nonnegative letting restrictions difference nonnegative measures disjoint supports lemma fact lemma assume satisfies condition constant signed measure proof theorem lemma constant satisfies condition constant introduce probability measure probability distribution measurable function measurable analog neural nets gaussian noise pick measurable prob conclude finally extend measure assigning measure complement measurable subsets pick show satisfies condition constant comparison measure definition measurable required probability measures applying lemma recursively conclude words length pick integer equation length probability measures means measurable probability measures measurable lemma pick assume applying inequality implies argument similar proved included completes proof theorem maass sontag construction noise robust analog neural nets section exhibit method making feedforward analog neural nets robust regard arbitrary analog noise type considered preceding sections method prove corollary missing positive part claim main result theorem article theorem threshold circuit arbitrary function assume arbitrary parameters transform analog noise type considered section noiseless threshold circuit analog neural number gates gates employ function activation function circuit input output noisy analog neural differs probability output idea proof maximal fanin gate maximal absolute weight choose large density function noise vector satisfies gate inputs choose large finally choose factor large analog neural results multiplication weights thresholds replacement heaviside activation functions gates activation function corollary proof positive part main result theorem holds considered theorem corollary assume arbitrary finite alphabet language form arbitrary finite subsets language recognized noisy analog neural desired reliability spite arbitrary analog noise type considered section proof construct feedforward threshold circuit recognizing receives input symbol form fixed encoded binary states input units boolean circuit tapped delay line fixed length easily implemented feedforward threshold circuit layers consisting gates compute identity function single binary input preceding layer achieve feedforward circuit computes boolean function sequences presented circuit hand language form finite exists decide characters feedforward threshold circuit tapped delay line type decide apply theorem circuit define accepting states resulting analog neural states computation completed output gate assumes theorem analog neural recognizes reliability formally precise apply theorem threshold circuit receives analog neural nets gaussian noise input single batch sequence proof theorem readily extends case conclusions exhibited fundamental limitation analog neural nets gaussian common noise distributions probability density function nonzero large accept simple regular language begins holds designer neural allowed choose parameters gaussian noise distribution architecture parameters neural proof result introduces mathematical arguments investigation neural computation applied stochastic analog computational systems presented method analog neural nets robust type noise implies regular languages ends recognized recurrent analog neural gaussian noise combination negative result yields precise regular languages recognized recurrent analog neural nets gaussian noise noise distribution large support references casey casey dynamics discretetime computation application recurrent neural networks finite state machine extraction neural computation types bull math maass maass effect analog noise discretetime analog computations advances neural information processing tems journal version neural computation giles giles constructing deterministic finitestate automata recurrent neural networks assoc comput mach asymptotic analysis stochastic equations studies probability theory studies mathematics edited rosenblatt math assoc america networks noisy gates ieee computer science ieee press york complexity measures networks unreliable gates developments synthesis reliable organisms unreliable components proc pure mathematics probabilistic automata information control
5 improving convergence hierarchical matching networks object recognition utans gene gindi department electrical engineering yale university yale station haven abstract interested analog neural networks recog visual objects objects parts composed structural relationship struc models stored database recognition prob reduces matching data models structurally object recognition problem general involves coupled problems grouping segmentation matching limit problem simultaneous parts single object determination analog parameters coupled problem reduces weighted match problem optimizing neural network binary match variables data parts model parts weights dependent parameters work show solving estimates solving obtain good initial parameter estimates yield solutions current address international computer science institute center street suite berkeley current address department electrical engineering utans gindi figure stored model compositional hierarchy compare figure recognition stochastic forward models frameville object recognition system introduced mjolsness makes compositional hierarchy represent stored models recognition problem formulated minimization objective function mjolsness proposed derive objective function describing recognition problem principled stochastic model describes objects system designed recognize stochastic visual grammar description data representation compositional hierarchy stage description object detailed parts added stochastic model assigns probability distribution stage process level hierarchy detailed description parts terms probability distribution coordinates explicitly distributions finer control individual part descriptions general parameter error terms goal derive joint probability distribution instance object parts appears scene probability observing object prior arrival data observed image recognition problem stated bayesian inference problem neural network solves stochastic model model shown figure object parts represented line segments parameters denoting position length stick orientation model considers rigid translation object image model stored central position chosen uniform density parts level structural relationships stored coordinates objectcentered coordinate frame relative placing parts gaussian distributed noise added position coordinates capture notion natural variation objects shape variance coordinate specific assume distribution coordinates variance length improving convergence hierarchical matching networks object recognition component relative angle addition assume simplicity parts independently distributed composed parts simplicity notation assume composed number note index figure corresponds double track belongs model side denotes step models parts image permutation matrix chosen probability identity lost step omitted recognition problem reduce problem estimating part parameters parts labeled grammar compute final joint probability distribution constant terms collected constant frameville architecture part labelling single object stochastic forward model part labelling problem single object present scene translates reduced frameville architecture depicted figure compositional hierarchy steps stochastic model parts added level match variables lowest level permutation step grammar parts image matched model parts parts found belong stored object grouped single match neuron highest level unity assume objects identity single object present similarly terms level unity correct grouping grouping point forward model description addition intermediate level loss generality frames matched ahead time parameters computed data introducing part permutation intermediate levels redundant additional simplification grouping variables lowest level parts lowest level expressed terms part match explicitly representing variables input system recognition involves finding parameters utans gindi model data figure frameville architecture stochastic model grammar leads reduced frameville style network architecture single model stored model side instance model present input data model side represent object parts compare figure arcs represent structural relationship data side triangles represent parameter vectors frames describing instance object scene lowest level represent input data parameters higher levels hierarchy computed network represented bold triangles represents grouping parts data side text horizontal lines represent assignments frames data side nodes model side intermediate level frames parts model side match variables lowest level represented bold lines circles labelling parts bayes theorem recognition reduces finding probable values data solving inference problem involves finding estimate equivalent minimizing exponent equation respect bootstrap coarse scale hints initialize network compositional hierarchy scale space labelling approaches found vision literature object labelled coarse resolution level approximate parameters found topdown approach information higher abstract levels improving convergence hierarchical matching networks object recognition spatial scale lower abstraction figure compositional hierarchy scale space hierarchy compositional hierarchy represent scale space hierarchy successive levels hierarchy detail added object select initial values parts lower level abstraction segmentation labelling lowest level strongly influenced results level fact general terms scheme marr essence hierarchical model base shape matched highest levels terms relative objectbased parameters parts level recalled memory serve initial values unspecified segmentation algorithm derives part parameters step repeated recursively lowest level reached note highest level abstractions correspond levels spatial scale design model base demands elements compositional hierarchy scale include summarize inclusion parameters figure illustrates correspondence representations sense compositional hierarchy applied shapes includes notion scale operation blurring data notion scale space utilized differs application method lowlevel computations visual domain auxiliary coarse scale representations computed explicitly object represen tations frameville system earlier combines bottomup topdown elements topdown aspects scheme marr incorporated frameville architecture simulation results suggest performance expected neural network problems addressed obtain observed data coarse estimate slot parameters highest level crude estimates utilize recall default settings segmentation level utans gindi model bootstrap figure bootstrap computation network grammar analog frame variables intermediate level initialized data bootstrap computation bold lines flow information initialization coarse scale parameters propose convergence initial values analog variables computed data making labelling general solve analog parameters knowledge correct permutation matrix purpose obtaining approximation derive objective function depend parameters integrating summing permutation matrices permutation formulation leads elastic type network imple mentation separate network bootstrap computations expensive simpler computation coarse scale parameters estimated computing sample averages finding solution elastic high temperature limit position find integrating similarly assignment data side model side point term equations evaluated approximating actual variance improving convergence hierarchical matching networks object recognition average variance equations reduce terms objective function translates assuming error terms parts weighted equally weights depend actual part match corresponds identity parts approximation assumes variances differ large amount approximation close true values model designed part primitives lowest level grammar highly specialized case abstractions higher levels model approximation proved sufficient problems studied neural network perform calculation elastic assigns approximately equal weights assignments high temperatures behavior expressed original network match variables choosing leads bootstrap computation specific choice analog variables updated compute coarse scale estimates network constant neural network implementation computing equation converged compute parameters intermediate levels hypothesized coarse scale estimate adding transformation recall intermediate levels part identity preserved permutation steps takes place network random values match variables compute correct labelling correct parameters simulation results bootstrap procedure implemented hierarchical model model describes shown figure incorrect solutions observed vast majority cases violate permutation matrix constraint assignment unique assignment unique parts assigned correctly commonly identity neighboring parts cases large variance advantage bootstrap initialization clear figure simulation noise variance identical parts work computed solution reliably large noise variances cases performance network initialization rapidly experiments graph simulations performed network initialization consistently outperformed network figure shows time measured number iterations network converge unaffected increase noise variance initial values derived data close final solution cases random starting point close correct solution network initialization converges rapidly figure reflect typical behavior demonstrate advantage computing approximate initial values utans gindi success rate convergence speed figure comparing network initialization solid line left success rate rate network converged correct solutions denotes noise variance intermediate level model noise variance lowest level experiments graph simulations performed network initialization consistently outperformed network initialization graph shows average time takes network converge measured number iterations averaged experiments simulations network converged correct solution compute average time convergence stopping criterion required match neurons assume values error bars denote standard deviation acknowledgements work supported part afosr grant afosr mjolsness rangarajan helpful discussions references gindi mjolsness anandan neural networks model based recogni tion neural networks concepts applications implementations pages prenticehall david vision freeman york mjolsness bayesian inference visual grammars neural nets optimize technical report yale university dept computer science visual grammars neural nets lippmann moody editor advances neural information processing systems morgan kaufmann publishers mateo eric mjolsness gene gindi anandan optimization model matching perceptual organization research report yale univer sity department computer science eric mjolsness gene gindi anandan optimization model matching organization neural computation utans neural networks object recognition compositional thesis department electrical engineering university haven utans gene gindi eric mjolsness anandan neural networks object recognition compositional hierarchies initial experiments report yale university center systems science department electrical engineering yuille generalized deformable models statistical physics matching prob lems neural computation
12 independent factor analysis temporally structured sources attias gatsby unit university college london queen square london abstract present technique time series analysis based namic probabilistic networks approach observed data modeled terms unobserved mutually independent factors recently introduced technique independent factor anal ysis unlike factors factor temporal statistical characteristics derive family algorithms learn structure underlying factors relation data algorithms perform source separation noise reduction integrated manner demonstrate superior performance compared introduction technique independent factor analysis introduced tool modeling data terms unobserved factors factors mutually independent combine linearly added noise produce observed data mathematically model defined vector factor activities time data vector mixing matrix noise applied statistics hand signal processing hand statistics ordinary factor analysis gaussian factors contrast factor arbitrary distribution modeled mixture gaussians parameters mixing matrix noise covariance matrix learned observed data expectationmaximization algorithm derived signal processing independent component analysis method blind source separation factors termed sources task blind source separation recover observed data knowledge mixing process sources nongaussian distributions unlike distributions fixed prior knowledge limited significant restrictions dynamic independent factor analysis number data dimensionality square mixing matrix assumed invertible data assumed noisefree contrast including sources sensors nonzero noise unknown covariance addition flexible model proves crucial achieving successful separation generalizes model learned classification fitting model class missing data context blind separation optimal reconstruction sources data obtained estimator suffer temporal information attempt model temporal statistics data square noisefree mixing words model learned affected time indices modeling data time series facilitate filtering forecasting accurate classification source separation applications learning temporal statistics provide additional information sources leading source reconstructions problem blind separation noisy data terms components source separation noise reduction approach twostage procedure perform noise reduction wiener filtering perform source separation cleaned data algorithm notice procedure directly exploits temporal secondorder statistics data stage achieve stronger noise alternative approach exploit temporal structure data indirectly temporal source model resulting algorithm operations source separation noise reduction coupled approach present paper present approach independent factor problem based dynamic probabilistic networks order capture temporal statistical properties observed data describe source hidden markov model resulting dynamic model describes multivariate time series terms independent sources temporal characteristics section presents learning algorithm case section presents algorithm case isotropic noise case noise turns computationally intractable section approximate algorithm based variational approach notation multivariable gaussian density denoted time blocks denoted coordinate function denotes averaging ensemble blocks noise source model employed advantages capable approximating arbitrary densities learned efficiently data gaussians correspond hidden states sources labeled assume time source state signal generated order sampling gaussian distribution variance capture temporal statistics data sources temporal structure introducing transition matrix states focusing attias time block resulting probabilistic model defined joint density sources time points equation unmixing matrix usual noisefree scenario section assuming mixing matrix square invertible graphical model observed density defined parametrized model describes source firstorder reduces model temporal structure means autoregressive model advantageous models highorder poral statistics facilitates learning omitting derivation maximization respect results incremental update rule natural gradient appropriately chosen learning rate source parameters obtain update rules standard initial probabilities updated notation posterior densities computed estep source terms data forwardbackward procedure algorithm generalized schemes efficient procedure source parameters learn separating matrix learn source parameters back repeat notice rule similar natural gradient version bell rule fact coincide sources recognize baumwelch method phase algorithm separates sources generalized rule phase learns source remark model time series terms smaller number factors framework noisefree model achieved applying algorithm largest principal components data notice data generated factors remaining principal components vanish equivalently apply algorithm data directly unmixing matrix results figure demonstrates performance method mixture speech signals passed nonlinear function distributions mixture source model actual source densities discussion applied dynamic network mixture speech signals distributions dynamic independent factor analysis figure left source distributions middle outputs algo rithm independent outputs correlated made gaussian nonlinear transformation temporal information crucial separation case mixture separable algorithm accomplished separation successfully isotropic noise turn case nonzero noise assume noise white zeromean gaussian distribution covariance matrix general case computationally intractable section reason step requires computing posterior distribution source states case source signals posterior complicated structure show assume isotropic noise square invertible mixing posterior simplifies considerably making learning inference tractable adapting idea suggested dynamic probabilistic network start preprocessing data linear transformation makes covariance matrix unity denotes averaging time blocks diagonal covariance matrix sources square invertible implies diagonal fact unobserved sources determined scaling factor variance source unity obtain property shown source posterior product individual sources means variances time quantities depend data states expression omitted transition probabilities posterior distribution effectively defines source emission transition probabilities derive learning rule compute conditional source signals time data recursively forwardbackward procedure obtain attias fractional form results imposing orthogonality constraint lagrange multipliers computed procedure source parameters computed learning rule omitted similar noisefree rule easy derive learning rule noise level fact ordinary rule suffice point algorithm derived case perfectly defined noise general case noise mixing computationally intractable exact estep requires summing source configurations times problem stems fact sources independent sources conditioned data vector correlated resulting large number hidden configurations problem arise noisefree case avoided case isotropic noise square mixing orthogonality property cases exact posterior sources algorithm derived based variational approach approach introduced context sigmoid belief networks constitutes general framework learning intractable probabilistic networks context idea approximate tractable posterior place lower bound likelihood optimize parameters maximizing bound starting point deriving bound neal formulation algorithm denotes averaging respect arbitrary posterior density hidden variables observed data exact shown obtained maximizing bound respect posterior estep model parameters step resulting true intractable posterior contrast variational choose differs true posterior facilitates tractable estep estep parametrized variational transition probabilities multiplying parameters subject normalization constraints original source signals time jointly gaussian covariance means covariances transition probabilities time data dependent scheme motivated form posterior notice quantities variational parameters related scheme context parameters adapted independently model parameters algorithm expected give superior results compared isotropic dynamic independent factor analysis mixing reconstruction quality source figure left quality model parameter estimates reconstructions text true posterior correlated temporally approximate variational parameters optimized maximize bound equivalently minimize distance true posterior requirement leads fixed point equations ensure factors quantities computed forwardbackward procedure variational transition probabilities variational param eters determined solving iteratively block practice found iterations required convergence mstep update rules mixing parameters source parameters computed variational transition probabilities notice learning rules source parameters baumwelch form spite correlations conditioned sources variational approach correlations hidden fact fixed point equations couple parameters time points depends sources source reconstruction observe source estimate depends results algorithm demonstrated source separation task speech signals transformed nonlinearities arbitrary densities mixed random matrix signal levels error estimated left solid line quantified size elements relative attias diagonal results obtained temporal information plotted reference dotted line squared error reconstructed sources solid line result dashed line shown estimate reconstruction errors algorithm smaller reflecting advantage exploiting temporal structure data additional experiments numbers sources sensors gave similar results notice algorithm unlike previous considered situations number sensors smaller number sources separation quality good expected opposite case conclusion important issue addressed model selection algorithms arbitrary dataset number factors states factor determined crossvalidation required computational effort fairly large recent paper develop framework bayesian model selection model averaging probabilistic networks framework termed variational bayes proposes algorithm approximates full posterior distributions hidden variables parameters model structure predictive quantities analytical manner applied algorithms presented good preliminary results field approach find important applications speech suggests building signal models based combining independent lowdimensional hmms fitting single complex contribute improving recognition performance noisy multi speaker conditions characterize realworld auditory scenes references attias independent factor analysis comp bell sejnowski approach blind separation blind comp amari cichocki yang learning algorithm blind signal separation info touretzky press cambridge pearlmutter parra maximum likelihood blind source separation contextsensitive generalization info proc mozer press cambridge fast fixedpoint algorithm independent compo nent analysis comp attias schreiner blind source separation dynamic component analysis algorithm comp rabiner juang speech recognition prentice hall englewood cliffs sompolinsky unpublished personal communication saul jaakkola jordan field theory sigmoid belief networks ghahramani jordan factorial hidden markov models mach learn neal hinton view algorithm justifies mental sparse variants learning graphical models jordan kluwer academic press attias variational bayesian framework graphical models info proc leen press cambridge
8 lyapunov functions competitive neural networks institute technical technical university germany abstract dynamics complex neural networks modelling organization process cortical maps include aspects long shortterm memory behaviour network characterized equation neural activity fast equation synaptic modification slow part neural system present lyapunov function flow competitive neural system fast slow dynamic variables show consequences stability analysis neural parameters introduction paper investigates special class laterally inhibited neural networks examined dynamics restricted class laterally inhibited neural networks rigorous analytic standpoint network models retinotopic cortical maps posed layers neurons sensory receptors cortical units feedforward layers lateral recurrent connection layer standard techniques include hebbian rule variations modifying synaptic efficacies lateral inhibition establishing organization cortex approximation namics relaxation fast time scale dynamics learning slow time scale network cases puter simulation results obtained provided limited mathematical understanding neural response fields networks study model dynamics neural activity levels shortterm memory dynamics synaptic modifications longterm memory actual network models consideration considered extensions shunting network model primitive neuronal competition earlier networks considered pools mutually inhibitory neurons fixed synaptic connections results extended earlier studies systems synapses modified external stimuli dynamics competitive systems extremely complex exhibiting convergence point attractors periodic attractors networks model dynamic neural activity levels cohen grossberg found lyapunov function condition convergence behavior point attractors paper apply results theory lyapunov functions perturbed systems largescale neural networks types state variables describing slow fast dynamics system find lyapunov function neural system give design concept storing desired pattern stable equilibrium points class neural networks section defines network differential equations characterizing laterally inhibited neural networks laterally inhibited network deter signal hebbian learning similar spatiotemporal system amari general neural network equations describe temporal evolution activity modification states synaptic modification neuron network equations current activity level time constant neuron contribution external stimulus term neurons output inhibition term external stimulus dynamic variable represents synaptic modification state defined assume input stimuli normalized vectors unit magnitude systems subject analysis considerations stability equilibrium points asymptotic stability neural networks show section determine asymptotic stability class neural networks interpreting nonlinear perturbed systems singular perturbation theory traditional tool dynamics nonlinear mechanics wide variety dynamic phenomena slow fast modes show singular perturbations present lyapunov functions competitive neural networks problems sense apply paper results valuable analysis tool dynamics laterally inhibited networks shown lyapunov function system obtained weighted lyapunov functions lower order systems socalled reduced systems assuming systems asymptotically stable lyapunov function conditions derived guarantee sufficiently small parameter asymptotic stability perturbed system established means lyapunov function composed weighted lyapunov functions reduced systems adopting notations perturbed system assume origin unique equilibrium point unique solution reduced system defined setting obtain assuming unique root reduced system rewritten system defined time scale vector treated fixed unknown parameter takes values establish stability properties perturbed system small reduced system system lyapunov functions system shown mild assumptions sufficiently small weighted lyapunov functions reduced system lyapunov function perturbed system assumptions stated reduced system lyapunov function function vanishes condition guarantees asymptotically stable equilibrium point reduced system symbol closed sphere centered defined system lyapunov function function vanishes condition guarantees asymptotically stable equilibrium point system inequalities hold constants nonnegative inequalities determine interaction slow fast variables basically smoothness requirements remarks stability criterion stated theorem suppose conditions hold positive number positive number origin asymptotically stable equilibrium point lyapunov function global neural time constant equation determine lyapunov functions system system mentioned global lyapunov function competitive neural network activation dynamics constraints system equation contribution considered fixed unknown parameter lyapunov functions competitive neural networks system coupled dynamics design stable competitive neural networks competitive neural networks learning rules moving equilibria learning process concept asymptotic stability derived matrix theory capture phenomenon design section competitive neural network store desired pattern stable equilibrium theoretical implications illustrated neuron network nonlinearity linear function equations system system time msec figure time histories neural network origin equilibrium point states nonnegative constants interesting implications results interpreted achieve stable equilibrium point negative contribution external stimulus term excitatory inhibitory contribution neurons time constant neuron evolution trajectories states neuron system shown figure states exhibit oscillation expected equilibrium point states reach monotonically equilibrium point pictures equilibrium point reached msec choosing obtain formula maximum conclusions presented paper lyapunov function analyzing stability equilibrium points competitive neural networks fast slow dynamics global stability analysis method interpreting neural networks nonlinear perturbed systems equilibrium point constrained neighborhood technique monotonically increasing nonlinearity symmetric lateral inhibition matrix learning rule deterministic hebbian method upper bound perturbation lyapunov functions competitive neural networks time msec figure time histories neural network origin equilibrium point states parameter estimation maximal positive neural practical implication theoretical problem design competitive neural network store desired pattern stable equilibrium references amari competitive cooperative aspects dynamics neural tation selforganization competition cooperation neural works amari field theory selforganizing neural nets ieee transactions systems machines communication cohen grossberg absolute stability global pattern mation parallel memory storage competitive neural networks ieee transactions systems cybernetics grossberg adaptive pattern classification universal recording biological cybernetics hebb organization behavior wiley verlag lyapunov functions perturbed systems ieee transactions automatic control june
3 learning combining gradient descent john platt road suite jose abstract created radial basis function network computational unit unusual pattern presented network network learns allocating units adjusting parameters existing units network performs poorly presented pattern unit allocated response presented pattern network performs presented pattern network parameters updated standard gradient descent predicting mackey glass chaotic time series network learns faster backpropagation comparable number synapses introduction networks perform function interpolation tend fall categories networks gradient descent learning backpropagation constructive networks learning knearest neigh networks gradient descent learning tend form compact repre sentations learning cycles find representation networks inputs exposed examples grow linearly training size network presented compromise descent gradient descent easy input vectors hard input vectors network performs input learning combining gradient descent vector input vector close stored vector network adjusts parameters gradient descent input vector output vector allocating unit storage inputoutput pair means pair immediately improve performance system information gradient descent network called network units sponse localized input space unit nonlocal response undergo gradient descent nonzero output large fraction training data constructive network automatically adjusts number units reflect complexity function interpolated networks units case network poorly case network generalizes poorly parzen windows knearest neighbors require number stored patterns grow linearly number presented patterns number stored patterns grows eventually reaches maximum previous work previous workers networks localized basis functions broomhead lowe moody darken poggio girosi moody extended work incorporating table lookup moody table network values table nonzero entry table activated presence nonzero input probability adjusts centers gaussian units based error output poggio girosi networks centers highdimensional grid broomhead lowe moody networks unsupervised clustering center placement moody darken generate larger networks move centers increase accuracy previous workers created function interpolation networks allocate fewer units size training cascadecorrelation fahlman lebiere sonn tenorio mars friedman construct networks adding additional units algorithms work algorithm improves algorithms making addition unit simple simple algebra find parameters unit cascade correlation mars gradient descent sonn simulated annealing algorithm section describes network consists network strategy allocating units learning rule refining network network twolayer network layer consists platt units respond local region space input values layer linearly outputs units creates function approximates inputoutput mapping entire space simple function implements locally tuned unit gaussian continuous polynomial approximation speed algorithm loss network accuracy chosen empirically make gaussian output network outputs weighted synaptic strength global polynomial represent information local parts space polynomial represents global information term thought bump added subtracted polynomial term yield desired function linear term function strong linear component results section mackeyglass equation predicted constant term learning algorithm network starts blank patterns stored patterns presented network chooses store point network current state reflects patterns stored previously allocate unit pattern unit allocated network output equal desired output index unit peak response newly allocated unit memorized input vector linear synapses layer difference output network output learning combining gradient descent width response unit proportional distance nearest stored vector input vector overlap factor grows larger responses units overlap condition inputoutput pair memorized input existing centers difference desired output output network large typically desired accuracy output network errors larger immediately corrected allocation unit errors smaller gradually gradient descent distance scale resolution network fitting input presentation learning starts largest length scale interest typically size entire input space nonzero probability density distance reaches smallest length scale interest network average features smaller function decay constant system creates coarse representation function representation allocating units smaller smaller widths finally system learned entire function desired accuracy length scale stops allocating units altogether condition creating compact network condition network allocate units gradient descent correct small errors condition units allocated order represent features allocating units eventually represents desired function closely network trained fewer units needed accuracy firstlayer synapses synapses parameters global polynomial adjusted decrease error widrow hoff gradient descent synapses decrease error unit allocated platt addition adjust centers responses units decrease error equation derived gradient descent equation empirically equa tion works polynomial approximation results application interpolating predict complex time series test case chaotic time series generated nonlinear algebraic differential equation series shortrange time coherence long term prediction difficult tested chaotic time series created mackeyglass equation trained network predict values inputs network tested learning modes offline learning limited amount data online learning large amount data mackeyglass equation learned offline workers back propagation algorithm lapedes farber radial basis functions moody darken predict mackeyglass equations parameters learning epochs reached epochs simulated cases figure shows efficiency learning algorithms smallest accurate algorithms lower left optimized size network weights backpropagation accurate efficiency roughly backpropagation requires computation takes approximately minutes time reach accuracy listed figure backpropagation approximately minutes time mackeyglass equation learned online techniques moody online parameters reached input presentations table compares online error versus size network bspline moody personal communication cases algorithm similar accuracy number units allocated factor smaller detailed results mackeyglass equation platt learning combining gradient descent looo number weights bspline standard kmeans backpropagation figure error test versus size network backpropagation stores prediction function compactly accurately takes large amount computation form compact representation compact accurate backpropagation computation form representation table comparison method number units normalized error bspline level hierarchy bspline levels hierarchy conclusions desirable attributes network learns learn quickly learn accurately form compact representation formation compact representation important networks implemented hardware silicon area compact representation important statistical reasons network parameters overfit data generalize poorly platt previous network algorithms learned quickly expense pact representation formed compact representation putation network find compact representation reasonable amount computation carver mead fernando pineda comments paper special john moody provided comments paper provided data references broomhead lowe multivariable function interpolation adaptive networks complex systems fahlman lebiere cascadecorrelation learning architecture advances neural information processing systems touretzky morgankaufmann mateo friedman multivariate adaptive regression splines department statistics stanford university tech report lapedes farber nonlinear signal processing neural networks prediction system modeling technical report alamos national laboratory alamos moody darken learning localized receptive fields proceed ings connectionist models summer school touretzky hinton sejnowski morgankaufmann mateo moody darken fast learning networks locallytuned processing units neural computation moody fast learning multiresolution hierarchies advances neural information processing systems touretzky morgan kaufmann mateo platt network function interpolation neural computation poggio girosi regularization algorithms learning equiv multilayer networks science radial basis functions multivariable interpolation review algorithms approximation mason press oxford tenorio selforganizing neural networks fication problem advances neural information processing systems touretzky morgankaufmann mateo widrow hoff adaptive switching circuits convention record york
5 transfer neural networks pratt department mathematical computer sciences colorado school golden abstract previously introduced idea neural network transfer learning target problem weights obtained network trained related source task present algorithm called transfer information measure estimate utility hyperplanes defined source weights target network transferred weight magnitudes experiments demonstrate target networks initialized learn significantly faster networks initialized randomly introduction neural networks trained scratch relying training data guidance networks trained tasks reasonable seek methods avoid wheel build previously trained networks results speech recognition network trained american english speak application speakers british accent tasks larger distribution english speakers related exploited speed learning british network compared weights randomly initialized previously introduced question trained neural networks transfer neural networks pratt called problem idea transfer strong roots psychology discussed standard paradigm neurobiology synapses ways formulate transfer problem retaining performance source task important problem called sequential learning explored authors cohen paradigm assumes source task performance important source task training data subset target training data method viewed addressing sequential learning transfer knowledge inserted entry points backpropagation network pratt focus changing networks initial weights studies change aspects objective function thrun mitchell transfer methods backpropagation target task training formulation degrade worst case source task relevance backpropagation training target task randomly initialized weights alternative approach studies explored literal transfer backpropagation networks final weights training source task initial conditions target training martin studies shown networks demonstrate worse performance literal transfer randomly initialized paper describes transfer algorithm overcomes problems literal transfer achieves asymptotic randomly initialized networks requires substantially fewer training updates superior literal transfer source network target task analysis literal transfer mentioned studies shown networks initialized literal transfer give worse asymptotic performance randomly initialized networks understand situation subset source network inputtohidden layer hyperplanes target problem figure observed hyperplanes initialized source network training dont shift initial positions fact dont separate target training data weights defining hyper planes high magnitudes sontag figure shows simulation situation hyperplane high magnitude source network learning analysis backpropagation weight update equations reveals high source weight magnitudes backpropagation learning target task network visualization explored paper author anonymous type pratt source training data target training data hyperplanes retained hyperplanes move feature figure problem illustrating source target tasks identical target task shifted axis represented training data shown shift source hyperplanes helpful separating data target task equation scaled relative weight magnitudes weight update equa tion factor units activation small large weights analysis simple solution problem literal transfer uniformly lower weight observed hyperplanes separating positions move high weight magnitudes address prob lems hyperplanes defined weights hyperplanes receive magnitudes implement method metric evaluating hyperplane utility evaluating classifier components metric evaluating hyperplanes decision tree induction quinlan training data hyperplane crosses function returns indicating amount hyperplane helps separate data classes formula decision surface multiclass prob number patterns depending side hyperplane pattern falls indexes classes count class patterns side count patterns side total number patterns class algorithm algorithm shown figure inputs target training data weights source network parameters outputs modified initializing training target task figure shows problem figure modifies weights defining source hyperplane proportional transfer neural networks literal feature feature feature feature feature feature feature feature figure hyperplane movement speed literal transfer compared image figure shows hyperplanes implemented weights epoch training hidden unit hyperplane solid line dotted line hyperplane shown dashed line note fixed place high magnitude learning slow taking epochs converge note small magnitude allowing flexible training data separated epoch randomly initialized network problem takes epochs pratt input source network weights target training data parameters cutoff factor factor output initial weights target network assuming topology source network method source network hidden unit compare hyperplane defined incoming weights target training data calculating values largest result highest magnitude ratio weights defining hyperplane reset weights hyperplane randomly uniformly scale hyperplane weights magnitude position remaining hidden unit weight defining hyperplane target network source weight hiddentooutput target network weights randomly figure transfer algorithm input based idea initial magnitude target hyperplane constant proportionally magnitude source network hyperplane source hyperplane target training data assume simple relationship holds range values parameter determines cutoff relationship source hyperplanes receive magnitudes hyperplanes effectively equivalent randomly initialized network parameter motivated empirical experiments multiplicative scaling adequate determine source task times small number epochs values chose values yielded average total squared errors epochs local hill climbing average space decide move space weights networks hiddentooutput layer extension work showing literal transfer weights effective empirical results evaluated tasks speaker transfer recognition task subset task single male task transfer heart disease diagnosis problem swiss patients transfer task patients california swiss patients transfer subset pattern recognition examples transfer neural networks subset chess problems chess note chess tasks effectively address sequential learning problem long source data subset target data target network build previous results compared randomly initialized networks target task measured generalization performance conditions cross validation initial conditions target task resulting runs conditions tasks empirical methodology controlled carefully initial conditions hidden unit count backpropagation learning rate momentum scenarios evaluation practical situations speed learning limited amount computer time detecting networks performance reached criterion case speedup method superior baseline large proportion epochs training probability stop period significant superiority high stop epoch method significantly justifies baseline resulting network situation detecting performance good application contrast situation network shorter time baseline network reaches criterion faster case number epochs significant important speed achieves criterion results evaluate networks scenario tested statistical signif level initialized networks training epoch found asymptotic performance scores random networks superior training period figure shows number weight updates significant difference found tasks found networks required fewer epochs reach criterion performance score test found significantly epoch methods measured number epochs required reach level number weight updates required randomly initialized networks reach criterion shown figure note axis logarithmic million weight updates saved random initialization problem results criteria showed fast random initialization task tests tested literal networks transfer tasks found unlike literal reached pratt time epoch difference random time required train criterion heart task heart heart task figure summary empirical results asymptotic performance scores randomly initialized networks literal networks learned slower tasks results justify complicated method literal transfer evaluated source networks directly target tasks backpropagation training target training data scores significantly substantially worse random networks result transfer scenarios chose evaluation nontrivial conclusion algorithm transfer neural networks demonstrated substantial significant learning speed improvement randomly initialized networks tasks studied learning speed task displayed worse asymptotic performance randomly initialized network shown superior literal transfer simply source network target task acknowledgements author indebted john smith martin valuable comments paper jack contribution research program pratt details transfer neural networks references online training algorithm overcome catastrophic forgetting intelligence systems artificial neural networks volume pages american society mechanical engineers press sontag sontag speed backpropagation algorithm proceedings joint conference neural networks washington volume pages ieee january martin martin effects learning hopfield back propagation nets technical report computer technology corporation cohen michael neal cohen catastrophic interference connectionist networks sequential learning problem psychology learning motivation john empirical comparison selection measures decision tree induction machine learning network approach learning learning intelligence engineering systems artificial neural networks volume pages american mechanical engineers press pratt pratt jack direct transfer learned information neural networks proceedings ninth national conference artificial intelligence pages pratt pratt experiments transfer knowledge neural networks hanson rivest editors computational learning theory natural learning systems constraints press pratt pratt transfer information inductive learners editors neural networks theory applications academic press quinlan quinlan learning efficient classification procedures application chess games machine learning pages palo alto publishing company adaptive generalisation transfer knowledge paper center connection science university thrun mitchell sebastian thrun mitchell inductive neural network explanationbased learning giles hanson cowan editors advances neural information processing systems morgan kaufmann publishers mateo
10 efficiency robustness natural gradient descent learning rule howard department computer science oregon graduate institute portland amari information synthesis brain science institute japan abstract inverse fisher information matrix gradient descent algorithm train singlelayer multilayer perceptrons discovered scheme represent fisher information matrix stochastic multilayer perceptron based scheme designed algorithm compute natural gradient input dimension larger number hidden neurons complexity algo rithm order confirmed simulations natural gradient descent learning rule efficient robust introduction inverse fisher information matrix required find lower bound analyze performance unbiased estimator needed natural gradient learning framework amari design statistically efficient algorithms estimating parameters general training neural networks paper assume stochastic model multi layer perceptrons riemannian parameter space fisher information matrix metric tensor apply natural gradient learning rule train singlelayer multilayer perceptrons main difficulty encountered compute inverse fisher information matrix large dimensions input dimension high exploring structure fisher information matrix inverse design fast algorithm lower complexity implement natural gradient learning algorithm yang stochastic multilayer perceptron assume model stochastic multilayer perceptron denotes transpose gaussian random variable differentiable output function hidden neurons assume multilayer network ndimensional input hidden neurons dimensional output denote weight vector output neuron weight vector hidden neuron vector thresholds hidden neurons matrix formed column weight vectors rewritten scalar function operates component vector joint probability density function input output define loss function includes parameters estimated fisher information matrix defined inverse inequality unbiased estimator true parameter online estimator based independent examples drawn probability inequality online estimator natural gradient learning parameter space divergence points kullbackleibler divergence points close quadratic form efficiency robustness natural gradient descent learning rule regarded square length depends parameter space regarded riemannian space local distance defined fisher information matrix plays role riemannian metric tensor shown steepest descent direction loss function riemannian space natural gradient descent method decrease loss function updating parameter vector direction multiplying gradient converted form consistent differential form loss function proved depend unknown online learning algorithms based gradient natural gradient learning rates negative loglikelihood function chosen loss function natural gradient descent algorithm fisher efficient online estimator amari asymptotic variance driven satisfies square error main difficulty implementing natural gradient descent algorithm compute natural gradient online overcome difficulty studied structure matrix proposed efficient scheme represent matrix briefly describe scheme partition denote proved blocks divided classes block linear combination matrices block matrix column combination coefficients combinations integrals respect multivariate gaussian distribution yang block matrix entries integrals respect detail expressions integrals techniques saad solla find analytic expressions integrals dimension input dimension larger number hidden neurons scheme space storing large matrix reduced gave fast algorithm compute natural gradient time complexity trick make structure matrix simulation section give simulation results demonstrate natural gradient descent algorithm efficient robust singlelayer perceptron assume inputs singlelayer perceptton online gradient descent natural algorithms learning rate schedules defined schedule proposed darken moody note search phase converge phase learning rate function search phase weaker converge phase small large decreases randomly choose vector teacher network choose parameters selected trial error optimize performance natural methods noise level training examples generated unknown algorithms efficiency robustness natural gradient descent learning rule weight vectors driven equations respec tively functions natural obtain lower equation vector denote bound natural levels yang amari figure performance natural fixed training examples shown figure teacher signal nonstationary simulations show natural algorithm reaches figure shows natural algorithm robust algo rithm change learning rate schedule performance algorithm constant learning rate schedule optimal contrary natural algorithm forms interval figure shows natural algorithm breaks larger critical means weak converge phase learning rate schedule multilayer perceptron simple multilayer perceptron input hidden neurons problem train committee machine based generated stochastic committee machine assume weight vector decrease dimension parameter space natural iteration figure natural parameter space assume true parameters symmetry true parameters computed algorithm natural efficiency robustness natural gradient descent learning rule algorithm errors measured simulation initial estimate start algorithm iterations estimate obtained algorithm iteration initial estimate natural algorithm algorithm iterations noise level independent runs conducted obtain errors define root square errors based independent runs errors computed pared figure learning schedule algorithm learning rate natural algorithm simply annealing rate conclusions natural gradient descent learning rule statistically efficient train adaptive system complexity learning rule depends architecture learning machine main difficulty implementing learning rule compute inverse fisher information matrix large dimensions multilayer perceptron shown efficient scheme represent fisher information matrix based space storing large matrix reduced shown algorithm compute natural gradient taking advantage structure inverse fisher information matrix found complexity computing natural gradient input dimension larger number hidden neurons simulation results confirmed fast convergence statistical efficiency natural gradient descent learning rule verified learning rule robust noise levels training examples parameters learning rate schedules references amari natural gradient works efficiently learning accepted neural computation amari neural learning structured parameter spaces natural riemannian gradient advances neural information processing systems mozer jordan petsche press cambridge pages darken moody faster stochastic gradient search neural information processing systems moody hanson lippmann morgan kaufmann mateo pages saad solla online learning soft committee machines physical review yang amari natural gradient descent training multilayer perceptrons submitted ieee neural networks
