###################################################################### ###################################################################### ## ## ## ## ## dynet ## ## ## ## a recurrent neural network ## ## for modelling dynamical systems ## ## ## ## by ## ## ## ## Coryn Bailer-Jones ## ## ## ## 22/05/98 ## ## ## ## email: calj@mpia-hd.mpg.de ## ## www: http://wol.ra.phy.cam.ac.uk/calj/ ## ## ## ## ## ## see the README file for disclaimer and warranty ## ## see the dynet_manual file for operational details ## ## ## ## This file is copyright 1998 by C.A.L. Bailer-Jones ## ## ## ## ## ###################################################################### ###################################################################### FILE: dynet_manual DESCRIPTION: operations manual for dynet AUTHOR: Coryn Bailer-Jones LAST MOD DATE: 26/06/98 ###################################################################### ###################################################################### ## ## ## dynet operations manual ## ## ## ###################################################################### ###################################################################### This file provides the information required to use the dynet software. It assumes an understanding of the principles behind dynet. This manual is relevant to version 1.19 of the software. ###################################################################### # What dynet is # ###################################################################### dynet is a recurrent neural network for modelling dynamical systems by means of discrete time measurement of temporal patterns produced by the dynamical system. Its learning routine is fully recurrent, and can be viewed as performing temporal interpolation of one or more temporal patterns. ###################################################################### # Other sources of information # ###################################################################### A paper, Bailer-Jones & MacKay 1998, which explains the model in detail, has been submitted to a journal, and will be available on my web page once accepted. In the meantime, a short description is available in the paper "Static and Dynamic Modelling of Materials Forging", Bailer-Jones et al. 1998, available from http://wol.ra.phy.cam.ac.uk/calj/ajiips98.html I strongly recommend you read these papers. The source code for dynet is reasonably well documented. The conjugate gradient optimizer "macopt" has its own source of documentation, which can be obtained from http://wol.ra.phy.cam.ac.uk/mackay/c/macopt.html Further enquiries can be addressed to the author at the email address at the head of this file. Related publications and other information are available from the dynet web page http://wol.ra.phy.cam.ac.uk/calj/dynet.html ###################################################################### # The dynet files # ###################################################################### Makefile the makefile README disclaimer and warranty. Please read this before you proceed. ansi/ a directory which contains macopt (David MacKay's conjugate gradient optimizer) and its ancillary files. This is the standard macopt distribution and has not been modified in any way. dynet the dynet executable dynet.c main dynet program dynet.h header files dynet_manual this file dynetsubs.c collection of dynet subroutines (the division of subroutines in this file and dynet.c is not entirely clear anymore, although I'm not sure it ever was) ran1.c random number generator syn/ a directory containing the data files for a synthetic problem (see section "synthetic problem" below) syn.26.spec the specfile for the synthetic problem ###################################################################### # How to run dynet # ###################################################################### First unpack the tar file, dynet.tar. Type tar xvf dynet.tar This will create a directory called "dynet" which will contain all of the files listed above. dynet has been written in ANSI C and was developed initially on a SUN platform and then continued under Linux. The executable included in this release is an i386 binary, i.e. a Linux program. To get it to run under any other flavour of UNIX you'll need to compile it. To do this all you need to do, in theory, is type make dynet but I can't guarantee that this will work. You may well need to adjust the makefile. I may well transport dynet to other UNIX platforms so it would be worth getting in touch if you have problems. The make will give a few warnings, but these are not important and can be ignored. Note that in one place dynet uses a UNIX specific command via the C function "system" (although this only prints the date so it's hardly crucial). To run dynet, type dynet specfile where specfile is the only command line argument which dynet reads. Typing "dynet -v" will give the version number only. See the section "synthetic problem" for an example application. ###################################################################### # The specfile # ###################################################################### The specfile contains all of the relevant information on the temporal pattern (tp) files, network architecture and training details. specfiles should be given the ".spec" suffix. The specfile is read by searching for strings, such as "train_network?_(yes/no)", therefore these strings must not be changed or they wil be ignored. The relevant input for each string must be the next item on the same line, e.g. train_network?_(yes/no) yes The items in parentheses "(yes/no)" indicate the choices available. Note that any line in the specfile preceded by a "#" symbol will be ignored. Inappropriate inputs which disobey the required value will be flagged as errors by dynet. The only exception is when real values are specified instead of integers, in which case only the integer part of the number will be used (ANSI C %f to %d conversion). The order of the input strings is arbitrary, although there are a few restrictions, most of which are obvious and are indicated below. The various input strings in the specfile are described below, along with the possible choices (given in rounds parentheses if not part of the relevant string already) and the default value (in square parentheses). Note that some strings do not have default values: their values must be specified or dynet will exit with an error message. Note also that a few of the options have not yet been implemented or fully de-bugged: these are flagged below. Most of the input strings are prefixed by a three letter code indicating to what part of the code they are relevant: NET network architecture TRN training of network GRD gradient descent MAC macopt (conjugate gradient optimizer) APP application of network Note that in earlier versions of this manual the state variables were often referred to as "recurrent inputs". Although I now use the term "state variable", there may still be some reference to "recurrent inputs" in the program source code. The terms are synonymous. The possible entries in the specfile are now listed and discussed. verbosity_level_(0/1/2/3/4) [2] - Amount of output from the program, ranging from 0 (nothing apart from error messages) to 4 (lots of diagnostic stuff). Level 2 is appropriate for normal running. train_network?_(yes/no) [no] - do you want to learn the weights from a given set of data. apply_network?_(yes/no) [no] - do you want to apply the network to a set of data. NET:number_of_state_variables_(V) (+ve integer) [no default] - total number of state variables NET:number_of_measured_state_variables_(Vm) (+ve integer) [no default] - number of state variables. Must have V>= Vm. If V-Vm>0, then this is the number of "unmeasured" state variables (see BJM98). NET:number_of_external_inputs_(X) (integer >= 0) [no default] - number of external inputs (excludes input-hidden bias, which is automatically included) NET:number_of_hidden_nodes_(H) (integer >= 0) [no default] - number of hidden nodes in the one and only hidden layer (excludes hidden-output bias, which is automatically included). Note that you can set H=0, although I have no idea why you may want to do this. NET:data_scaling_(none/var/maxmin/netsize) [var] - the external inputs and state variables can, and should, be scaled. "var" separately scales each input and state variables to have zero mean and unit standard deviation (this is the recommended option). "maxmin" is not yet implemented. The "netsize" option ensures that the sum input to the hidden layers does not grown with the number of inputs. The input-hidden transfer function is H = tanh (Hlam*S) where S is the sum over the product of each input and its associated input-hidden weight. If the "netsize" option is used, Hlam is set to Hlam = 1/( sqrt((double)(Xsize+1+Vsize)). This is also included as part of the "var" scaling option. NET:input_weight_file (file name) [no default] - if dynet has already been trained and a weight file produced, this weight file can be read in using this option. This is used to continue training from a given set of weights, or if you just want to use these weights to evaluate state variable sequences for a given sequence of external inputs. Often you will train and apply in a single run, in which case this field does not need to be set. Weight files should be given the suffix ".wt". See section below for details of file format. TRN:output_weight_file (file name) [dynet.wt] - The default weight file name is only there in case you forget to specify it yourself. If you just apply the network with a set of weights which you read in (using 'NET:input_weight_file') then this option will be ignored, i.e. the weights file will not be re-written. Weight files should be given the suffix ".wt". See section below for details of file format. TRN:form_of_weight_init_(uniform/gaussian) [uniform] - initial weights for the network are drawn from a uniform or a Gaussian distribution. (Actually, they can only be drawn from a uniform distribution as the Gaussian option has not been implemented). TRN:initial_weight_range (real value) [0.1] - scale of random distribution from which initial weights are drawn. If "uniform" distribution has been chosen, it will range from -wtrng to +wtrng, where wtrng is the value specified here. TRN:random_number_seed (integer value) [731] - used to seed selection of initial weights. TRN:optimization_method_(grd/macopt) [macopt] - the weights can be optimised using the gradient descent method (grd) or a conjugate graident optimizer (macopt, written by David MacKay). Both are implemented, but grd has not been fully tested. TRN:update_method_(1/3/4) [4] - With gradient descent, the weights can be updated in one of three ways: 1. after each epoch of each temporal pattern file (this is like on-line learning or Real Time Recurrent Learning extended to multiple temporal patterns) 3. after all epochs of each temporal pattern file 4. after all epochs of all temporal pattern files (total batch) Although with macopt these options could also apply, only method no.4 has been implemented. TRN:weight_decay_(none/default/list) [none] - weight decay can be used to regularize the training procedure. 1/sqrt(alpha) can be thought of as the standard deviation of the Gaussian prior over the weights (with zero mean). The alpha parameters of the weight decay (see BJM98) can be set using the "list" option, or the default values can be used. Note that the current version of dynet cannot learn the optium alpha values from the data. If the list option is used, the following four lines must be specified: TRN:alpha_VH (real >=0) TRN:alpha_XH (real >=0) TRN:alpha_bH (real >=0) TRN:alpha_HY (real >=0) - These are the alpha parameters for the state variable to hidden, external input to hidden, input bias to hidden, and hidden to output weights respectively. alpha values can be set to zero, e.g. if you only want to apply weight decay to some sets of weights. TRN:use_beta_parameters?_(yes/no) [no] - beta is the coefficient of the error term for each state variable. 1/sqrt(beta) can be considered as the standard deviation of the noise in the state variables. Note that if you use scaling, then beta is on the scale of the scaled variables, not the raw values in the temporal pattern files. If this option is set to "no", all of the beta values are set to the default value, BETADEF, which is 1. Otherwise, the user must specify the beta values for the V state variabless on the next V lines. Thus is V=2 the next two lines would be: TRN:beta (real >=0) TRN:beta (real >=0) for the first and second state variables respectively. If the number of beta values specified is fewer than V, the remainder will set to the last value of beta given (I don't recommend you use this, but it's useful if your changing the number of unmeasured state variables and don't remember to alter the number of betas). If any beta value for any state variable is set to zero, then that state variable will not contribute anything to the error function. You can think of this as saying that the noise on this variable is infinite, so you don't care what it's value is. I don't know why you may want to do this, but the option is there. TRN:number_of_temporal_pattern_files (+ve integer) - if the value specified is P, then the next P lines must be the names of the P temporal pattern files to be used for training the network. The required format of these files is specified in the next section. The files should have the suffix ".tpin1". GRD:number_of_iterations (+ve integer) - if using gradient descent, this is the total number of training iterations which will be performed GRD:learning_rate (+ve real) - if using gradient descent, this is the learning rate (eta). MAC:convergence_tolerance (+ve real) - if using macopt, this is the gradient convergence tolerance. In other words, once the *square* of the gradient is less than this value, training will halt. MAC:maximum_number_of_iterations (+ve integer) - if using macopt, this is the maximum number of training iterations which will be performed. MAC:perform_maccheckgrad?_(yes/no) [no] - if using macopt, you can check that the gradient dynet is evaluating correct by using a routine in macopt which compares the analytic gradient with one calculated using first differences. This should only be necessary when debugging, but may be worth checking if dynet appears to be going wild. MAC:maccheckgrad_tolerance (real >=0) [0.000001] - to tolerance at which to check the gradient. APP:plot_file_name (file name) - when dynet is applied to a new set of data, the values of the state variable at the last epoch for each temporal pattern file are written to this file. See section below for details of file format. File should be given the ".dat" extension. APP:include_v(t=0)_in_plot_file?_(yes/no) [no] - allows you to also have the initial v values written to the ".dat" file specified by "APP:plot_file_name". APP:write_tper_files?_(yes/no) [no] - Select whether or not you want an error file for each temporal pattern. See section below for details of file contents. APP:number_of_temporal_pattern_files (+ve integer) - if the value specified is P, then the next P lines must be the names of the P temporal pattern files to which dynet is applied to give temporal sequences. The required format of these files is specified in the next section. The files should have the suffix ".tpin1". ###################################################################### # File formats # ###################################################################### Input files (user written): tp input files: ".tpin1" or ".tpin2" Output files (dynet written): error files: ".err" weight files: ".wt" tp output files: ".tpot" tp error files: ".tper" final epoch files: ".dat" Temporal Pattern input files (.tpin1 .tpin2) -------------------------------------------- The temporal pattern input files contain the time series (temporal patterns) which you wish dynet to model. They should be given suffices ".tpin1" or ".tpin2" (see below for distinction). An example of the file is as follows. # dynet tpin file - do not add or remove lines # ############################################# # Vm (meas rec), X (ext input), epochs: 2 2 11 # Data (epoch/recurrent/external): 0 0.00 0.00000 0.00000 -0.74179 0.32059 1 0.49 x x 0.01926 0.53881 2 0.34 x x -0.83042 0.50905 3 0.24 x x -0.94573 -0.51070 4 0.27 x x -0.68900 -0.14798 5 0.18 x x 0.88249 -0.59774 6 0.03 x x -0.79563 0.91946 7 0.21 x x 0.37771 -0.45139 8 0.14 x x 0.32615 -0.88162 9 0.30 x x 0.16759 0.60141 10 0.49 2.24016 -0.16262 0.85671 0.37204 The header must consist of five lines. The first three are comment lines. The fourth line has three fields: 1. number of measured state variables, Vm 2. number of external inputs, X 3. number of epochs, N The fifth line is a comment line. The next N lines are the data at the N epochs. The first epoch sets the initial conditions. There are 2+Vm+X columns. 1. epoch label, t (can be any number, e.g. consecutive integers to number lines or total elapsed time) 2. time step between the current epoch and the previous epoch, dt. It follows from this definition that dt at the initial epoch is not used by dynet, but it is convenient to set this to zero for clarity. 3. The next Vm columns are the values of the Vm state variables. When training dynet, these form the target values used to define the error which is to be minimized. (The exception is the intial values of v, i.e. v(t=0).) When applying dynet, we will typically only have v(t=0). If v is specified at additional epochs, these will not be used by dynet in the application phase. Whenever a v is not specified an "x" or "X" should be written. If ever any v at t=0 is not specified (i.e. an "x" written), dynet will set it to the default value VINITDEF, which is zero. This is necessary as we must always have an initial condition for a dynamical system. I RECOMMEND THAT YOU ALWAYS SPECIFY THE INITIAL VALUES OF V. (The VINITDEF value will not be subject to any scaling you are using in dynet, that is, *within the network* the initial values will be set to VINITDEF. If scaling is used these initial values will translate to other values in the output files. While I think the code will handle all this properly, I still strongly recommend for your own sanity to specify v(t=0), whether or not you used scaling. Your specified values will, of course, be subject to any scaling you use.) In training dynet we will often only specify v at the initial and final epochs. However, if v is specified at intermediate epochs these values will also be used to define the minimization error, and thus help to learn the weights. 4. The next X columns are the values of the external inputs. These must be defined at every epoch (a future development of dynet is intended in which this restriction will be lifted). Note that the data at a single epoch is *not* an input--target pair. The whole point of the dynamic model in dynet (which is a discrete approximation of a first order PDE) is that x and v values at time t-1 produce a v at time t. Therefore, the target for x(t-1) and v(t-1) is v(t), i.e. the value on the next line. It follows then that the external inputs at the final epoch are not used by dynet. However, they should not just be left as blanks. Just write in some number, e.g. 0. In a future version I'll allow you to write the conventional "x" or "X". The file suffix ".tpin2" is used for files in which the complete state variable sequence is specified. The suffix ".tpin1" is used when only the intial and final state variables are specified. You can, of course, used log values as the inputs and state variables. However, linear (non-logged) values for the dt steps must be used. This is on account of the first order expansion of the Taylor series which evaluates the next state variables on the bases of the output (derivative) and previous state variables. Error files (.err) ------------------ This is a dump of the nework error function and the error surface gradient as a function of iteration number. It is currently only produced when using macopt for training. The file name is specified by the "TRN:error_file" string in the training file. The file has 7 columns: 1. training iteration number 2. the likelihood error, lerr 3. the fractional contribution of lerr to the total error, i.e. lerr/toterr 4. the weight decay (regularization) error, werr 5. the fractional contribution of werr to the total error, i.e. werr/toterr 6. the total error, toterr = lerr + werr 7. the gradient, g. g = sqrt(gg), where gg is the squared gradient written by macopt (and written to STOUT when verbose>=2). Note that the errors scale with the total number of targets defined in the training data. The errors are also in terms of the scaled variables internal to the program. The gradient has similar dependencies. The data in this file is really intended for a qualitative indiciation of how training proceeds, or for making comparisons between different network models trained with identical data sets. Weight files (.wt) ------------------ The weight file is written by dynet after training, the file name being specified by the specfile string "TRN:output_weight_file". The weight files can also be read in by dynet using the string "APP:input_weight_file". A typical weight file is: # dynet weights file - do not add or remove lines # ############################################### # V (tot state), Vm (meas state), X (ext input), H (hidden): (exc biases) 2 2 2 8 # scaling type: var # V (state variables) mean and stdev scaling factors: 6.17275e-01 8.04589e-01 -1.55049e+00 3.46766e+00 # X (external input) mean and stdev scaling factors: -6.07996e-01 6.13417e-01 5.71975e-01 6.80077e-01 # Lambda scale parameter for hidden layer: 0.44721 # wtVH (state-hidden weights): -0.56010 -1.29233 0.45621 0.66428 2.30999 0.48269 -0.15339 -1.21327 -1.02396 1.33431 1.21793 1.34536 2.67156 -1.26362 0.94542 0.34031 # wtXH (input-hidden weights): -1.99026 1.60542 0.50664 1.57786 4.08565 -0.11660 2.82143 -0.85219 0.06155 1.16499 2.59686 -0.45088 -1.31578 -2.89263 0.31608 -0.22411 2.50053 1.29021 1.24943 0.87688 -2.49171 -2.27946 1.72011 1.44805 # wtHY (hidden-output weights): 2.07704 2.22658 0.56307 0.22885 1.47584 -2.00067 -0.10726 0.80079 1.28100 3.81347 1.06695 1.72487 -2.92203 -0.39190 2.28566 1.23744 0.84620 -0.79857 The first three lines are comments. The fourth contains four fields: 1. total number of state variables, V 2. number of measured state variables, Vm 3. number of external inputs, X 4. number of hidden nodes, H The fifth line is a comment line. The sixth line specifies the data scaling type. If this is "var", then the next lines specify the scaling factors, as shown in the above example. If "var" or "netsize" scaling is given, the Hlam parameter is also specified. The weights themselves are specified in three groups: 1. wtVH (state variable-hidden weights) 2. wtXH (input-hidden weights) 3. wtHY (hidden-output weights) Temporal Pattern output files (.tpot) ------------------------------------- For each temporal pattern input file to which dynet is applied, a temporal pattern output file is produced. If the input file name is TPFILE, the corresponding output file is called TPFILE.tpot. The exception is when TPFILE has the suffix ".tpin1" or ".tpin2", in which case this suffix is replaced with the ".tpot" suffix. dynet tells you when it runs exactly what the tpot files will be called, e.g. syn.04.200.tpin1 -> syn.04.200.tpot myfile -> myfile.tpot A typical tpot file is # dynet temporal pattern output file # ################################## # input file = syn.04.600.tpin1 # weights file = syn.04d.wt # V (tot state), Vm (meas state), epochs: 2 2 11 # State variables (epoch/measured/unmeasured): 0 -0.00000 -0.00000 1 0.46486 -0.19967 2 0.36306 -0.32085 3 0.57170 -0.39786 4 0.98769 -0.36904 5 1.20681 -0.35651 6 1.17218 -0.36505 7 1.32371 -0.34871 8 1.27403 -0.36157 9 1.29278 -0.28753 10 2.10707 -0.46257 The first five lines are comment lines, which tell you what the corresponding tpin file and weight files are. The sixth line has three fields: 1. number of state variables, V 2. number of measured state variables, Vm 3. number of epochs, N The seventh line is a comment line. The next N lines are the values of the V state variables at the N epochs. The first epoch is the initial conditions. There are V+1 columns: 1. epoch label, t, numbering from 0 to N-1 inclusive. The next Vm columns are the Vm measured state variables. The next V-Vm columns are the unmeasured state variables. Temporal Pattern error files (.tper) ------------------------------------ These are very similar to the tpot files, but with the targets and some additional error information added. The tper files are only written if "APP:write_tper_files?_(yes/no)" is set to "yes" in the specfile. See the documentation above on the format of the tpot files. A typical tper file is: # dynet temporal pattern error file # ################################## # input file = syn.04.600.tpin1 # weights file = syn.04d.wt # V (tot state), Vm (meas state), epochs: 2 2 11 # Measured (state/target/state-target/|diff/target|): 0 -0.00000 -0.00000 0.00 0.00 -0.00000 -0.00000 0.00 0.00 1 0.46486 ------- 0.42805 ---- -0.19967 ------- -0.28822 ---- 2 0.36306 ------- 0.32624 ---- -0.32085 ------- -0.40940 ---- 3 0.57170 ------- 0.53489 ---- -0.39786 ------- -0.48641 ---- 4 0.98769 ------- 0.95088 ---- -0.36904 ------- -0.45759 ---- 5 1.20681 ------- 1.17000 ---- -0.35651 ------- -0.44506 ---- 6 1.17218 ------- 1.13536 ---- -0.36505 ------- -0.45360 ---- 7 1.32371 ------- 1.28690 ---- -0.34871 ------- -0.43726 ---- 8 1.27403 ------- 1.23721 ---- -0.36157 ------- -0.45012 ---- 9 1.29278 ------- 1.25597 ---- -0.28753 ------- -0.37608 ---- 20 2.10707 2.24016 -0.13309 0.06 -0.46257 -0.16262 -0.29995 1.84 The header (first seven lines) are the same as the corresponding tpot file. The last N lines consist of 1+4*Vm columns. The first column is t. There are then four columns for each state variable: 1. The state variable at that epoch (as given in tpot file), v 2. The corresponding target, t (as given in the tpin file). If no target was specified in the tpin file then "-------" will appear. 3. diff = v - t 4. |diff/t| If t=0, column 4 will show "Div0", to flag a divide by zero. If t=diff=0, column 4 will show "0.00". Final Epoch files (.dat) ------------------------ Intended for making a plot of predicted vs. measured for the final epoch for all tp files. It consists of P lines, where P is specified in the specfile with the string "APP:number_of_temporal_pattern_files". There are 1+(2*Vm) columns. The first column labels the patterns from 1 to P inclusive. There are then two columns for each state variable: 1. The state variable at the final epoch (as given in the tpot file) 2. The corresponding target (as given in the tpin file). If no target was specified in the tpin file then "-------" will appear. If "APP:include_v(t=0)_in_plot_file?_(yes/no)" was set to "yes" in the specfile, then an additional Vm columns will be added on the right which give the initial conditions for each pattern. If initial conditions for any pattern were not specified then the default value, VINITDEF, will be written. ###################################################################### # The synthetic problem # ###################################################################### The directory syn/ contains files for a synthetic problem. It is the same problem as discussed in Bailer-Jones & MacKay 1998. It consists of two external inputs, x1 and x2, and two state variables, v1 and v2. The problem is: dv1/dt = x1 - 2*v1 + 8*v2 - x1*v1 dv2/dt = x2 - 5*v1 + v2 - x2*v2 The autonomous part of this dynamical system (that with the external inputs set to zero) is a decaying harmonic oscillator, with period 1.0 and e^-1 damping timescale 2.0. The files syn.26.000.tpin2 to syn.26.099.tpin2 are 100 instantiations of this dynamical systems (i.e. 100 temporal patterns). In all cases the x input sequences were generated from constrained random walks: x1 (x2) changes with a probability per unit time of 0.65 (0.999) by a random amount uniformly distributed between -0.5 and +0.5 (-1 and +1). The modulus of x is then taken to ensure a positive sequence. The initial v values were randomly selected from a uniform distribution between -1 and +1. The sequences were simulated numerically between t=0 and t=8 inclusive, and sampled with a constant epoch spacing of dt=0.1. Thus the files contain 81 lines. As explained above, the corresponding .tpin1 files (i.e. syn.26.000.tpin1 to syn.26.099.tpin1) are the same files but with the state variable data removed at all but the initial and final state variables removed. The files syn.26.200.tpin1 to syn.26.299.tpin1 are the same temporal patterns but with the v1 state variable removed completely. These can be used to test the performance of dynet with an unmeasured state variable. The specfile provided, syn.26a.spec, is set up to apply dynet to the syn.26.000.tpin1 to syn.26.099.tpin1 files. The weights file, syn/syn.26a.wt, is the result of having trained dynet on the syn.26.000.tpin1 to syn.26.049.tpin1 files: the screen dump from this run can be seen in the file syn/syn.26a.out; the error file is syn/syn.26a.err. Running dynet on the specfile as it stands will produce syn/syn.26a.dat and the .tpot and .tper files for the temporal patterns. I suggest that you play with this synthetic problem to get a feel for dynet and to get familiar with the specfile. ###################################################################### # The dynet program # ###################################################################### Most of what follows will not be required by the user, and has not been written with the user in mind. It is also far from comprehensive. Subroutines ----------- The subroutines within dynet are ordered in the following manner: Principal Control Routines dynettrain, dynetapply Forward Pass Routines dynetloop Gradient Descent Routines graddescent, updatewt Macopt-relevant Routines callmacopt, dymacint, dymacfn Gradient Evaluation Routines ederiv, cumederivs Initialisation Routines dynetinit, dynetloopinit, dysyswtinit Scaling Routines scalecalc, datascale, unscale Input/Output Routines specread, dataread, writeweights, evtpnewname Some subroutines, in particular memory allocation/deallocations, and the transfer and error functions, are in the dynetsubs.c file. scalecalc(): Evaluates scaling for the data. Options will be implemented but the current procedure is to scale data to have zero mean and unit variance. Although dynet deals with different patterns which may show behaviour over very different time scales, we still expect a given input variable to be of the same type for all patterns. In particular we expect any input to be of the same scale for all patterns. Therefore we only need one mean and one variance parameter for any one input. scalecalc(): This routine always calculates Hlam based on the size of the network. Therefore if you choose var or maxmin data scaling Hlam will also be set. datascale(): Scales the data using scaling factors calculated by scalecalc() or read in from a file. This scales both the external inputs and the state variables. Note that scaling of the latter automatically scales the outputs (Y). No matter what size the delta_time terms are, the Y values can accommodate this and keep the V values in scale. This is because of the linear output transfer function which gives the Y values an arbitrarily large dynamic range. This has implications for weight decay as we don't want weight decay to penalize large hidden-output weights just because, for example, the time_deltas are very small (thus requiring the Ys and hence the wtHY weights to be large). What I call a single pattern is a single temporal pattern, i.e. a time sequence of external inputs and corresponding state variables. I deal with one pattern at a time, i.e. I evaluate the derivatives at all of the epochs of a given pattern before moving onto the next pattern. When I update the weights is decided in graddescent() where there are the three update method available. When using macopt() only the batch update method is implemented. dysysinit(): Initializes the dynamic system. Initialisation must occur when a new pattern is presented to the network for training. All of the _prev variables are initialized to zero. If they need to be changed then the default values are set in the header file. However this would surely only apply to v_prev and y_prev as I cannot see why te initial weight derivatives should be anything but zero. The effects of the data scaling must be considered when initialising the system. Pointer arithmetic is done in callmacopt and dymacint to accommodate the use of single offset vectors macopt. This potentially makes the code less robust as it would screw up if the types for the weight (wtvec) and error gradient (wt_grad) were ever changed (from the present double type). However, the change required would simply be to define wtvec and wt_grad as floats rather than doubles, which is the kind of change one should make anyway if the types of a dependent subroutine change. Bayesian Aspects ---------------- The network incorporates basic Bayesian features via the noise terms (beta parameters) and weight decay terms (alpha parameters). There is a separate noise term for each state variable. There are four classes of weight decay terms: alpha[0] state variable to hidden weights (VH) alpha[1] external input to hidden weights (XH) alpha[2] input bias to hidden weights alpha[3] hidden to output weights (inc. bias) (HY) Given the argument in the "Scaling" section you may not want to use the alpha[3] term: the scale of the delta(t) terms will influence your choice of alpha[3]. Both the alpha and beta terms are used in the ederiv() subroutine for calculating the total error derivative. Thus the ed terms returned by this subroutine contain both the likelihood term (beta) and the prior (alpha). Note that the scale of the gradient which the network evaluates depends on the scale of the following: 1. the input data 2. the weights 3. the Bayesian parameters alpha and beta Provided you use some kind of data scaling, the first two are taken care of. However, the third is not. In particular, using values of alpha and beta which differ signficantly from unity will mean that the macopt convergence tolerance will have to be changed. Assuming beta >> alpha, then increasing beta by a factor of x will require MAC:convergence_tolerance to be increased by a factor of x too. Variable names -------------- Vsize, Xsize, Hsize and Ysize are numbers of nodes in the state variable layer, external input layer, hidden layer and output layer respectively. These numbers do not include biases. Corresponding vectors (v,x,h,y) start at 0. Input layer bias is x[Xsize] and is introduced to network by adding extra constant input to input vectors. Hidden layer bias is h[Hsize] and is set to HBIAS. counting variables with dedicated use: p - pattern t - epoch k - state variable (V) node l - external input (X) node m - hidden node n - output node i,j - general node values in ederiv() tar is generally a 3D array of type targets. Variable ordering is: tar[p][t][k] (pattern,epoch,node) 1 <= p <= Npats 0 <= t < ntsteps[p] 0 <= k < Vsize tar is a structure with two fields. The first, .def, specifies whether or not a target is specified. The second, .val, gives the value. If tar[p][t][k].def is zero, then the target has not been specified by the user, and you cannot expect tar[p][t][k].val to be meaningful. If you use scaling, the values stored in the tar array will be changed. If the initial tar values, tar[p][0][k], are not defined, we use the default value VINITDEF to initialise the sequence. Note, however, that this value is not written into tar[p][0][k], nor is tar[p][t][k].def set to 1.