Discrete VRACER

This solver implements a discrete version of VRACER (https://arxiv.org/abs/1807.05827)

Usage

e["Solver"]["Type"] = "Agent/Discrete/DVRACER"

Results

These are the results produced by this solver:

Variable-Specific Settings

These are settings required by this module that are added to each of the experiment’s variables when this module is selected.

Initial Exploration Noise
  • Usage: e[“Variables”][index][“Initial Exploration Noise”] = float

  • Description: Initial standard deviation of the Gaussian distribution from which the given action is sampled.

Configuration

These are settings required by this module.

Random Action Probability
  • Usage: e[“Solver”][“Random Action Probability”] = float

  • Description: Specifies the probability of taking a random action for the epsilon-greedy strategy.

Mode
  • Usage: e[“Solver”][“Mode”] = string

  • Description: Specifies the operation mode for the agent.

  • Options:

    • Training”: Learns a policy for the reinforcement learning problem.

    • Testing”: Tests the policy with a learned policy.

Testing / Sample Ids
  • Usage: e[“Solver”][“Testing”][“Sample Ids”] = List of unsigned integer

  • Description: A vector with the identifiers for the samples to test the hyperparameters with.

Testing / Policy
  • Usage: e[“Solver”][“Testing”][“Policy”] = knlohmann::json

  • Description: The hyperparameters of the policy to test.

Training / Average Depth
  • Usage: e[“Solver”][“Training”][“Average Depth”] = unsigned integer

  • Description: Specifies the depth of the running training average to report.

Concurrent Environments
  • Usage: e[“Solver”][“Concurrent Environments”] = unsigned integer

  • Description: Indicates the number of concurrent environments to use to collect experiences.

Episodes Per Generation
  • Usage: e[“Solver”][“Episodes Per Generation”] = unsigned integer

  • Description: Indicates how many episodes to complete in a generation (checkpoints are generated between generations).

Mini Batch / Size
  • Usage: e[“Solver”][“Mini Batch”][“Size”] = unsigned integer

  • Description: The number of experiences to randomly select to train the neural network(s) with.

Mini Batch / Strategy
  • Usage: e[“Solver”][“Mini Batch”][“Strategy”] = string

  • Description: Determines how to select experiences from the replay memory for mini batch creation.

  • Options:

    • Uniform”: Selects experiences from the replay memory with a random uniform probability distribution.

Time Sequence Length
  • Usage: e[“Solver”][“Time Sequence Length”] = unsigned integer

  • Description: Indicates the number of contiguous experiences to pass to the NN for learning. This is only useful when using recurrent NNs.

Learning Rate
  • Usage: e[“Solver”][“Learning Rate”] = float

  • Description: The initial learning rate to use for the NN hyperparameter optimization.

L2 Regularization / Enabled
  • Usage: e[“Solver”][“L2 Regularization”][“Enabled”] = True/False

  • Description: Boolean to determine if l2 regularization will be applied to the neural networks.

L2 Regularization / Importance
  • Usage: e[“Solver”][“L2 Regularization”][“Importance”] = float

  • Description: Coefficient for l2 regularization.

Neural Network / Hidden Layers
  • Usage: e[“Solver”][“Neural Network”][“Hidden Layers”] = knlohmann::json

  • Description: Indicates the configuration of the hidden neural network layers.

Neural Network / Optimizer
  • Usage: e[“Solver”][“Neural Network”][“Optimizer”] = string

  • Description: Indicates the optimizer algorithm to update the NN hyperparameters.

Neural Network / Engine
  • Usage: e[“Solver”][“Neural Network”][“Engine”] = string

  • Description: Specifies which Neural Network backend to use.

Discount Factor
  • Usage: e[“Solver”][“Discount Factor”] = float

  • Description: Represents the discount factor to weight future experiences.

Experience Replay / Serialize
  • Usage: e[“Solver”][“Experience Replay”][“Serialize”] = True/False

  • Description: Indicates whether to serialize and store the experience replay after each generation. Disabling will reduce I/O overheads but will disable the checkpoint/resume function.

Experience Replay / Start Size
  • Usage: e[“Solver”][“Experience Replay”][“Start Size”] = unsigned integer

  • Description: The minimum number of experiences before learning starts.

Experience Replay / Maximum Size
  • Usage: e[“Solver”][“Experience Replay”][“Maximum Size”] = unsigned integer

  • Description: The size of the replay memory. If this number is exceeded, experiences are deleted.

Experience Replay / Off Policy / Cutoff Scale
  • Usage: e[“Solver”][“Experience Replay”][“Off Policy”][“Cutoff Scale”] = float

  • Description: Initial Cut-Off to classify experiences as on- or off-policy. (c_max in https://arxiv.org/abs/1807.05827)

Experience Replay / Off Policy / Target
  • Usage: e[“Solver”][“Experience Replay”][“Off Policy”][“Target”] = float

  • Description: Target fraction of off-policy experiences in the replay memory. (D in https://arxiv.org/abs/1807.05827)

Experience Replay / Off Policy / Annealing Rate
  • Usage: e[“Solver”][“Experience Replay”][“Off Policy”][“Annealing Rate”] = float

  • Description: Annealing rate for Off Policy Cutoff Scale and Learning Rate. (A in https://arxiv.org/abs/1807.05827)

Experience Replay / Off Policy / REFER Beta
  • Usage: e[“Solver”][“Experience Replay”][“Off Policy”][“REFER Beta”] = float

  • Description: Initial value for the penalisation coefficient for off-policiness. (beta in https://arxiv.org/abs/1807.05827)

Experiences Between Policy Updates
  • Usage: e[“Solver”][“Experiences Between Policy Updates”] = float

  • Description: The number of experiences to receive before training/updating (real number, may be less than < 1.0, for more than one update per experience).

State Rescaling / Enabled
  • Usage: e[“Solver”][“State Rescaling”][“Enabled”] = True/False

  • Description: Determines whether to normalize the states, such that they have mean 0 and standard deviation 1 (done only once after the initial exploration phase).

Reward / Rescaling / Enabled
  • Usage: e[“Solver”][“Reward”][“Rescaling”][“Enabled”] = True/False

  • Description: Determines whether to normalize the rewards, such that they have mean 0 and standard deviation 1

Reward / Rescaling / Frequency
  • Usage: e[“Solver”][“Reward”][“Rescaling”][“Frequency”] = unsigned integer

  • Description: The number of policy updates between consecutive reward rescalings.

Reward / Outbound Penalization / Enabled
  • Usage: e[“Solver”][“Reward”][“Outbound Penalization”][“Enabled”] = True/False

  • Description: If enabled, it penalizes the rewards for experiences with out of bound actions. This is useful for problems with truncated actions (e.g., openAI gym Mujoco) where out of bounds actions are clipped in the environment. This prevents policy means to extend too much outside the bounds.

Reward / Outbound Penalization / Factor
  • Usage: e[“Solver”][“Reward”][“Outbound Penalization”][“Factor”] = float

  • Description: The factor (f) by which te reward is scaled down. R = f * R

Termination Criteria

These are the customizable criteria that indicates whether the solver should continue or finish execution. Korali will stop when at least one of these conditions are met. The criteria is expressed in C++ since it is compiled and evaluated as seen here in the engine.

Max Episodes
  • Usage: e[“Solver”][“Max Episodes”] = unsigned integer

  • Description: The solver will stop when the given number of episodes have been run.

  • Criteria: (_mode == "Training") && (_maxEpisodes > 0) && (_currentEpisode >= _maxEpisodes)

Max Experiences
  • Usage: e[“Solver”][“Max Experiences”] = unsigned integer

  • Description: The solver will stop when the given number of experiences have been gathered.

  • Criteria: (_mode == "Training") && (_maxExperiences > 0) && (_experienceCount >= _maxExperiences)

Testing / Target Average Reward
  • Usage: e[“Solver”][“Testing”][“Target Average Reward”] = float

  • Description: The solver will stop when the given best average per-episode reward has been reached among the experiences between two learner updates.

  • Criteria: (_mode == "Training") && (_testingTargetAverageReward > -korali::Inf) && (_testingBestAverageReward >= _testingTargetAverageReward)

Testing / Average Reward Increment
  • Usage: e[“Solver”][“Testing”][“Average Reward Increment”] = float

  • Description: The solver will stop when the average testing reward is below the previous testing average by more than a threshold given by this factor multiplied with the testing standard deviation.

  • Criteria: (_mode == "Training") && (_testingAverageRewardIncrement > 0.0) && (_testingPreviousAverageReward > -korali::Inf) && (_testingAverageReward + _testingStdevReward * _testingAverageRewardIncrement < _testingPreviousAverageReward)

Max Policy Updates
  • Usage: e[“Solver”][“Max Policy Updates”] = unsigned integer

  • Description: The solver will stop when the given number of optimization steps have been performed.

  • Criteria: (_mode == "Training") && (_maxPolicyUpdates > 0) && (_policyUpdateCount >= _maxPolicyUpdates)

Max Model Evaluations
  • Usage: e[“Solver”][“Max Model Evaluations”] = unsigned integer

  • Description: Specifies the maximum allowed evaluations of the computational model.

  • Criteria: _maxModelEvaluations <= _modelEvaluationCount

Max Generations
  • Usage: e[“Solver”][“Max Generations”] = unsigned integer

  • Description: Determines how many solver generations to run before stopping execution. Execution can be resumed at a later moment.

  • Criteria: _k->_currentGeneration > _maxGenerations

Default Configuration

These following configuration will be assigned by default. Any settings defined by the user will override the given settings specified in these defaults.

{
"Concurrent Environments": 1,
"Discount Factor": 0.995,
"Episodes Per Generation": 1,
"Experience Replay": {
    "Off Policy": {
        "Annealing Rate": 0.0,
        "Cutoff Scale": 4.0,
        "REFER Beta": 0.3,
        "Target": 0.1
        },
    "Serialize": true
    },
"L2 Regularization": {
    "Enabled": false,
    "Importance": 0.0001
    },
"Mini Batch": {
    "Size": 256,
    "Strategy": "Uniform"
    },
"Model Evaluation Count": 0,
"Random Action Probability": 0.05,
"Reward": {
    "Outbound Penalization": {
        "Enabled": false,
        "Factor": 0.5
        },
    "Rescaling": {
        "Enabled": false,
        "Frequency": 1000
        }
    },
"State Rescaling": {
    "Enabled": false
    },
"Termination Criteria": {
    "Max Episodes": 0,
    "Max Experiences": 0,
    "Max Generations": 10000000000,
    "Max Model Evaluations": 1000000000,
    "Max Policy Updates": 0,
    "Testing": {
        "Average Reward Increment": 0.0,
        "Target Average Reward": -Infinity
        }
    },
"Testing": {
    "Policy": {    },
    "Sample Ids": []
    },
"Time Sequence Length": 1,
"Training": {
    "Average Depth": 100
    },
"Uniform Generator": {
    "Maximum": 1.0,
    "Minimum": 0.0,
    "Type": "Univariate/Uniform"
    },
"Variable Count": 0
}

Variable Defaults

These following configuration will be assigned to each of the experiment variables by default. Any settings defined by the user will override the given settings specified in these defaults.

{
"Initial Exploration Noise": -1.0
}