VBAF.RL.PPO.ps1

#Requires -Version 5.1
<#
.SYNOPSIS
    Proximal Policy Optimization (PPO) Agent for Reinforcement Learning
.DESCRIPTION
    Implements the PPO algorithm -- one of the most widely used RL
    algorithms in research and industry today.

    WHAT YOU ARE LEARNING HERE:
    ============================
    PPO is a POLICY GRADIENT method -- a fundamentally different approach
    to RL than Q-learning or DQN.

    Q-LEARNING (what you learned before):
      Learns a VALUE FUNCTION: Q(state, action) = expected future reward
      Chooses the action with the highest Q-value
      Indirect: learn values, then derive policy from values

    PPO (what you are learning now):
      Learns a POLICY DIRECTLY: pi(action | state) = probability of each action
      The policy IS the output -- no value table or Q-table needed
      Direct: learn "what to do" rather than "how good is each option"

    TWO NETWORKS -- ACTOR AND CRITIC:
    ==================================
    PPO uses TWO neural networks working together:

    ACTOR (the policy network):
      Input: state (4 numbers for CartPole)
      Output: probability for each action [0.3, 0.7] = 30% left, 70% right
      Role: decides WHAT to do
      Learns: to increase probability of actions that led to high reward

    CRITIC (the value network):
      Input: state (4 numbers for CartPole)
      Output: one number -- estimated total future reward from this state
      Role: evaluates HOW GOOD the current situation is
      Learns: to predict total reward accurately

    The Critic helps train the Actor by providing a BASELINE.
    Without a baseline, reward signals are noisy and hard to learn from.
    With a baseline: "this action was better than average" vs "worse than average."

    GENERALIZED ADVANTAGE ESTIMATION (GAE):
    =========================================
    Advantage = "how much better was this action than expected"
    Advantage = actual_return - critic_estimate

    Positive advantage: action led to MORE reward than expected -- do it more
    Negative advantage: action led to LESS reward than expected -- do it less

    GAE smooths the advantage estimate across multiple time steps
    using the lambda parameter (LambdaGAE = 0.95).
    This reduces variance (noisy estimates) at the cost of some bias.

    THE PPO "CLIP" TRICK:
    =====================
    The key innovation in PPO is the CLIPPED UPDATE.

    Older policy gradient methods (TRPO) could make huge policy updates
    that destabilised training -- like overcorrecting a steering wheel.

    PPO solves this by CLIPPING the update ratio:
      ratio = new_probability / old_probability
      clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)

    ClipEpsilon = 0.2 means:
      If the new policy pushes an action probability more than 20%
      above or below the old probability, the update is clipped.
      This keeps policy changes small and stable each step.

    ROLLOUT BUFFER:
    ===============
    Unlike DQN (which trains on random samples from a replay buffer),
    PPO collects a ROLLOUT -- a fixed-size sequence of recent experiences.
    After RolloutSteps experiences, it trains on ALL of them (UpdateEpochs times),
    then discards them and starts a new rollout.
    This is "on-policy" learning -- the agent learns from its own current behaviour.

    PPO vs DQN -- WHEN TO USE WHICH:
    ==================================
    DQN: simpler, works well for discrete actions, sample efficient
          uses experience replay (off-policy)
    PPO: more stable, handles continuous and discrete actions
          widely used in robotics, games (OpenAI Five, AlphaStar)
          on-policy (only learns from current policy's experience)

    THEORY REFERENCE:
    =================
    Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms."
    ArXiv:1707.06347. OpenAI.

    PPO replaced TRPO as the default policy gradient algorithm at OpenAI.
    It is simpler to implement, more stable, and almost as sample-efficient.

.NOTES
    Part of VBAF (Visual AI & Reinforcement Learning Framework)
    Educational use -- compare with DQN to understand policy vs value methods.
    Requires VBAF.Core.AllClasses.ps1 (loaded via VBAF.LoadAll.ps1)
#>


$basePath = $PSScriptRoot


# ============================================================
# PPOCONFIG -- hyperparameters
# ============================================================
#
# KEY PPO HYPERPARAMETERS:
#
# Gamma = 0.99 -- higher than DQN's 0.95
# PPO often uses higher gamma because it learns a value function
# (Critic) that can accurately estimate long-term returns.
# Higher gamma = agent cares more about distant future rewards.
#
# LambdaGAE = 0.95 -- GAE smoothing factor
# Controls the tradeoff between variance and bias in advantage estimates.
# LambdaGAE = 1.0: unbiased but high variance (pure Monte Carlo)
# LambdaGAE = 0.0: low variance but biased (pure TD error)
# LambdaGAE = 0.95: standard tradeoff used in the original PPO paper
#
# ClipEpsilon = 0.2 -- the PPO clip range
# Limits how much the policy can change in one update step.
# Smaller = more conservative updates (slower but more stable)
# Larger = faster updates (but risks instability)
# 0.2 is the value used in the original PPO paper.
#
# EntropyBonus = 0.01 -- encourages exploration
# Adds a bonus for maintaining a diverse (high entropy) policy.
# Without this, the policy might collapse to always choosing one action.
# "Entropy" here means how spread out the action probabilities are.
#
# UpdateEpochs = 4 -- training passes per rollout
# After collecting RolloutSteps experiences, train on them 4 times.
# More epochs = better use of data but risks overfitting to the rollout.
#
# RolloutSteps = 64 -- collect this many steps before each update
# Longer rollouts = more stable advantage estimates but slower updates.

class PPOConfig {
    [int]    $StateSize     = 4
    [int]    $ActionSize    = 2
    [int[]]  $ActorHidden   = @(64, 64)   # Actor network architecture
    [int[]]  $CriticHidden  = @(64, 64)   # Critic network architecture
    [double] $LearningRate  = 0.001
    [double] $Gamma         = 0.99        # Higher than DQN -- long-horizon planning
    [double] $LambdaGAE     = 0.95        # GAE smoothing (1.0=unbiased, 0.0=TD)
    [double] $ClipEpsilon   = 0.2         # PPO clip range -- limits policy change
    [double] $EntropyBonus  = 0.01        # Exploration bonus
    [int]    $UpdateEpochs  = 4           # Training passes per rollout
    [int]    $RolloutSteps  = 64          # Steps before each PPO update
    [int]    $MaxSteps      = 200         # Max steps per episode
}


# ============================================================
# PPOAGENT -- the learning agent
# ============================================================
#
# ACTOR-CRITIC ARCHITECTURE:
# --------------------------
# Actor: [StateSize -> hidden -> ActionSize] outputs action probabilities
# Critic: [StateSize -> hidden -> 1] outputs state value estimate
#
# Both networks share the same state input but have different outputs
# and different loss functions.
#
# DEPENDENCY INJECTION (PS 5.1 PATTERN):
# ----------------------------------------
# Same pattern as DQNAgent -- [object] type used for cross-file classes.
# Both Actor and Critic are NeuralNetwork instances built outside this class
# and passed in via the constructor.

class PPOAgent {
    [object] $Actor    # Policy network: state -> action probabilities
    [object] $Critic   # Value network: state -> expected return
    [object] $Config

    [int]    $TotalSteps     = 0
    [int]    $TotalEpisodes  = 0
    [int]    $UpdateCount    = 0
    [double] $LastActorLoss  = 0.0
    [double] $LastCriticLoss = 0.0
    [double] $LastEntropy    = 0.0   # Higher entropy = more exploration

    [System.Collections.Generic.List[double]] $EpisodeRewards
    [System.Collections.Generic.List[double]] $ActorLossHistory
    [System.Collections.Generic.List[double]] $CriticLossHistory

    # Rollout buffer -- stores RolloutSteps transitions before each update
    hidden [System.Collections.ArrayList] $States
    hidden [System.Collections.ArrayList] $Actions
    hidden [System.Collections.ArrayList] $Rewards
    hidden [System.Collections.ArrayList] $Values    # Critic estimates at each step
    hidden [System.Collections.ArrayList] $LogProbs  # Log probabilities of actions taken
    hidden [System.Collections.ArrayList] $Dones

    hidden [System.Random] $Rng

    PPOAgent([object]$config, [object]$actor, [object]$critic) {
        $this.Config  = $config
        $this.Actor   = $actor
        $this.Critic  = $critic
        $this.Rng     = [System.Random]::new()

        $this.EpisodeRewards    = [System.Collections.Generic.List[double]]::new()
        $this.ActorLossHistory  = [System.Collections.Generic.List[double]]::new()
        $this.CriticLossHistory = [System.Collections.Generic.List[double]]::new()

        $this.ClearRollout()

        Write-Host " PPOAgent created" -ForegroundColor Green
        Write-Host " State size : $($config.StateSize)"                  -ForegroundColor Cyan
        Write-Host " Action size : $($config.ActionSize)"                 -ForegroundColor Cyan
        Write-Host " Actor hidden : $($config.ActorHidden -join ' -> ')" -ForegroundColor Cyan
        Write-Host " Critic hidden : $($config.CriticHidden -join ' -> ')" -ForegroundColor Cyan
        Write-Host " Clip epsilon : $($config.ClipEpsilon)"               -ForegroundColor Cyan
        Write-Host " Rollout steps : $($config.RolloutSteps)"              -ForegroundColor Cyan
    }

    # SOFTMAX: converts raw network outputs (logits) to probabilities.
    # Subtracts the maximum before exp() to prevent numerical overflow.
    # Output: array of values summing to 1.0 -- a proper probability distribution.
    hidden [double[]] Softmax([double[]]$logits) {
        $max  = ($logits | Measure-Object -Maximum).Maximum
        $exps = @(0.0) * $logits.Length
        $sum  = 0.0
        for ($i = 0; $i -lt $logits.Length; $i++) {
            $exps[$i]  = [Math]::Exp($logits[$i] - $max)
            $sum      += $exps[$i]
        }
        $probs = @(0.0) * $logits.Length
        for ($i = 0; $i -lt $logits.Length; $i++) {
            $probs[$i] = $exps[$i] / $sum
        }
        return $probs
    }

    # SAMPLE ACTION: draw one action from the probability distribution.
    # If probs = [0.3, 0.7]: action 0 chosen 30% of the time, action 1 70%.
    # This is STOCHASTIC sampling -- unlike DQN's deterministic argmax.
    # Stochastic policy naturally explores without needing epsilon-greedy.
    hidden [int] SampleAction([double[]]$probs) {
        $r   = $this.Rng.NextDouble()
        $cum = 0.0
        for ($i = 0; $i -lt $probs.Length; $i++) {
            $cum += $probs[$i]
            if ($r -le $cum) { return $i }
        }
        return $probs.Length - 1
    }

    # LOG PROBABILITY: log(P(action)) -- used in the PPO ratio calculation.
    # We use log probabilities instead of raw probabilities because:
    # 1. log(a/b) = log(a) - log(b) -- subtraction is numerically stable
    # 2. Products of small probabilities underflow -- logs prevent this
    # Clamp to 1e-8 to avoid log(0) = -infinity.
    hidden [double] LogProb([double[]]$probs, [int]$action) {
        $p = [Math]::Max($probs[$action], 1e-8)
        return [Math]::Log($p)
    }

    # ENTROPY: measures how spread out the probability distribution is.
    # H = -sum(p * log(p))
    # High entropy: probabilities near equal (0.5, 0.5) -- uncertain, exploring
    # Low entropy: probabilities extreme (0.0, 1.0) -- confident, exploiting
    # EntropyBonus encourages the agent to maintain some exploration
    # throughout training -- prevents premature convergence to suboptimal policy.
    hidden [double] Entropy([double[]]$probs) {
        $h = 0.0
        foreach ($p in $probs) {
            if ($p -gt 1e-8) { $h -= $p * [Math]::Log($p) }
        }
        return $h
    }

    # ACT: the full Actor-Critic forward pass for one step.
    # Returns Action, LogProb (for PPO ratio), and Value (for GAE).
    # All three are needed by StoreTransition and Update.
    [hashtable] Act([double[]]$state) {
        $logits = $this.Actor.Predict($state)
        $probs  = $this.Softmax($logits)
        $action = $this.SampleAction($probs)   # Stochastic -- natural exploration
        $logP   = $this.LogProb($probs, $action)

        $valueOut = $this.Critic.Predict($state)
        $value    = $valueOut[0]   # Critic outputs one number

        return @{ Action = $action; LogProb = $logP; Value = $value; Probs = $probs }
    }

    # PREDICT: greedy action for evaluation (no sampling -- pick highest probability).
    # Used during benchmarking -- we want the agent's BEST action, not a random sample.
    [int] Predict([double[]]$state) {
        $logits = $this.Actor.Predict($state)
        $probs  = $this.Softmax($logits)
        $best   = 0
        for ($i = 1; $i -lt $probs.Length; $i++) {
            if ($probs[$i] -gt $probs[$best]) { $best = $i }
        }
        return $best
    }

    # Store one transition in the rollout buffer.
    # All five values are needed for the PPO update:
    # State, Action -- what happened
    # Reward -- what we got
    # Value -- what the Critic predicted we would get
    # LogProb -- log probability of this action under old policy
    # Done -- did the episode end
    [void] StoreTransition([double[]]$state, [int]$action, [double]$reward,
                           [double]$value, [double]$logProb, [bool]$done) {
        $this.States.Add($state)
        $this.Actions.Add($action)
        $this.Rewards.Add($reward)
        $this.Values.Add($value)
        $this.LogProbs.Add($logProb)
        $this.Dones.Add($done)
        $this.TotalSteps++
    }

    [void] ClearRollout() {
        $this.States   = [System.Collections.ArrayList]::new()
        $this.Actions  = [System.Collections.ArrayList]::new()
        $this.Rewards  = [System.Collections.ArrayList]::new()
        $this.Values   = [System.Collections.ArrayList]::new()
        $this.LogProbs = [System.Collections.ArrayList]::new()
        $this.Dones    = [System.Collections.ArrayList]::new()
    }

    # COMPUTE GAE -- Generalized Advantage Estimation.
    #
    # For each time step t in the rollout, compute:
    # delta_t = r_t + gamma * V(s_{t+1}) - V(s_t) (TD error)
    # A_t = delta_t + gamma * lambda * A_{t+1} (GAE recursion)
    #
    # This is computed BACKWARDS through the rollout (t = n-1 down to 0).
    # At episode boundaries (done=true), future advantages are zeroed out.
    #
    # Also computes discounted RETURNS = advantages + values (for Critic training).
    #
    # ADVANTAGE NORMALISATION:
    # Subtract mean and divide by std dev.
    # This keeps advantage values in a consistent range across rollouts,
    # making learning rate and other hyperparameters easier to tune.
    hidden [hashtable] ComputeGAE([double]$lastValue) {
        $n          = $this.Rewards.Count
        $advantages = @(0.0) * $n
        $returns    = @(0.0) * $n
        $gaeVal     = 0.0

        for ($t = $n - 1; $t -ge 0; $t--) {
            $done    = [bool]$this.Dones[$t]
            $reward  = [double]$this.Rewards[$t]
            $value   = [double]$this.Values[$t]
            $nextVal = if ($t -eq $n - 1) { $lastValue } else { [double]$this.Values[$t + 1] }

            if ($done) { $nextVal = 0.0; $gaeVal = 0.0 }  # No future at episode end

            $delta          = $reward + $this.Config.Gamma * $nextVal - $value
            $gaeVal         = $delta + $this.Config.Gamma * $this.Config.LambdaGAE * $gaeVal
            $advantages[$t] = $gaeVal
            $returns[$t]    = $gaeVal + $value
        }

        # Normalise advantages: subtract mean, divide by std dev
        $mean   = ($advantages | Measure-Object -Average).Average
        $sq     = $advantages | ForEach-Object { ($_ - $mean) * ($_ - $mean) }
        $stdDev = [Math]::Sqrt(($sq | Measure-Object -Average).Average + 1e-8)
        for ($i = 0; $i -lt $n; $i++) {
            $advantages[$i] = ($advantages[$i] - $mean) / $stdDev
        }

        return @{ Advantages = $advantages; Returns = $returns }
    }

    # THE PPO UPDATE -- train Actor and Critic on the collected rollout.
    #
    # For each transition in the rollout (repeated UpdateEpochs times):
    #
    # CRITIC UPDATE:
    # Target = discounted return (computed by GAE)
    # Train Critic to predict this return accurately
    # Loss = (predicted_value - return)^2
    #
    # ACTOR UPDATE (the PPO innovation):
    # ratio = exp(new_log_prob - old_log_prob) = new_prob / old_prob
    # If ratio > 1+epsilon: policy moved too far toward this action -- clip
    # If ratio < 1-epsilon: policy moved too far away -- clip
    # Objective = advantage * clipped_ratio
    # Train Actor to maximise this clipped objective
    #
    # ENTROPY BONUS:
    # Add entropy * EntropyBonus to encourage exploration
    # Prevents policy collapsing to deterministic too quickly
    [void] Update([double]$lastValue) {
        $gae        = $this.ComputeGAE($lastValue)
        $advantages = $gae.Advantages
        $returns    = $gae.Returns
        $n          = $this.States.Count

        $totalActorLoss  = 0.0
        $totalCriticLoss = 0.0
        $totalEntropy    = 0.0
        $updateSamples   = 0

        for ($epoch = 0; $epoch -lt $this.Config.UpdateEpochs; $epoch++) {
            for ($t = 0; $t -lt $n; $t++) {
                $state      = [double[]]$this.States[$t]
                $action     = [int]$this.Actions[$t]
                $oldLogProb = [double]$this.LogProbs[$t]
                $advantage  = $advantages[$t]
                $ret        = $returns[$t]

                # Critic update: learn to predict discounted returns
                $criticTarget    = @($ret)
                $criticLoss      = $this.Critic.TrainSample($state, $criticTarget)
                $totalCriticLoss += $criticLoss

                # Actor update: PPO clipped objective
                $logits   = $this.Actor.Predict($state)
                $probs    = $this.Softmax($logits)
                $newLogP  = $this.LogProb($probs, $action)
                $entropy  = $this.Entropy($probs)
                $totalEntropy += $entropy

                # PPO ratio: how much did the policy change for this action
                $ratio     = [Math]::Exp($newLogP - $oldLogProb)
                $clipRatio = [Math]::Max($this.Config.ClipEpsilon * -1,
                             [Math]::Min($this.Config.ClipEpsilon,
                             $ratio - 1.0)) + 1.0

                # Nudge action probability in direction of advantage, clipped
                $effectiveRatio         = [Math]::Min($ratio, $clipRatio)
                $actorTarget            = $probs.Clone()
                $nudge                  = $advantage * $effectiveRatio * 0.1 + $this.Config.EntropyBonus * $entropy
                $actorTarget[$action]   = [Math]::Max(0.01, [Math]::Min(0.99, $probs[$action] + $nudge))

                # Renormalise to keep valid probability distribution
                $sum = ($actorTarget | Measure-Object -Sum).Sum
                for ($i = 0; $i -lt $actorTarget.Length; $i++) {
                    $actorTarget[$i] = $actorTarget[$i] / $sum
                }

                $actorLoss      = $this.Actor.TrainSample($state, $actorTarget)
                $totalActorLoss += $actorLoss
                $updateSamples++
            }
        }

        if ($updateSamples -gt 0) {
            $this.LastActorLoss  = $totalActorLoss  / $updateSamples
            $this.LastCriticLoss = $totalCriticLoss / $updateSamples
            $this.LastEntropy    = $totalEntropy     / $updateSamples
            $this.ActorLossHistory.Add($this.LastActorLoss)
            $this.CriticLossHistory.Add($this.LastCriticLoss)
        }

        $this.UpdateCount++
        $this.ClearRollout()   # Discard rollout -- on-policy learning
    }

    [void] EndEpisode([double]$totalReward) {
        $this.TotalEpisodes++
        $this.EpisodeRewards.Add($totalReward)
    }

    [hashtable] GetStats() {
        $avgReward     = 0.0
        $avgActorLoss  = 0.0
        $avgCriticLoss = 0.0

        if ($this.EpisodeRewards.Count    -gt 0) { $avgReward     = ($this.EpisodeRewards    | Select-Object -Last 100 | Measure-Object -Average).Average }
        if ($this.ActorLossHistory.Count  -gt 0) { $avgActorLoss  = ($this.ActorLossHistory  | Select-Object -Last 100 | Measure-Object -Average).Average }
        if ($this.CriticLossHistory.Count -gt 0) { $avgCriticLoss = ($this.CriticLossHistory | Select-Object -Last 100 | Measure-Object -Average).Average }

        return @{
            TotalEpisodes  = $this.TotalEpisodes
            TotalSteps     = $this.TotalSteps
            UpdateCount    = $this.UpdateCount
            LastActorLoss  = [Math]::Round($this.LastActorLoss,  6)
            LastCriticLoss = [Math]::Round($this.LastCriticLoss, 6)
            LastEntropy    = [Math]::Round($this.LastEntropy,     4)
            AvgReward100   = [Math]::Round($avgReward,            3)
            AvgActorLoss   = [Math]::Round($avgActorLoss,         6)
            AvgCriticLoss  = [Math]::Round($avgCriticLoss,        6)
        }
    }

    [void] PrintStats() {
        $s = $this.GetStats()
        Write-Host ""
        Write-Host " +--------------------------------------+" -ForegroundColor Cyan
        Write-Host " | PPO Agent Statistics |" -ForegroundColor Cyan
        Write-Host " +--------------------------------------+" -ForegroundColor Cyan
        Write-Host (" | Episodes : {0,-20}|" -f $s.TotalEpisodes)   -ForegroundColor White
        Write-Host (" | Total Steps : {0,-20}|" -f $s.TotalSteps)      -ForegroundColor White
        Write-Host (" | PPO Updates : {0,-20}|" -f $s.UpdateCount)     -ForegroundColor White
        Write-Host (" | Avg Reward : {0,-20}|" -f $s.AvgReward100)    -ForegroundColor Green
        Write-Host (" | Entropy : {0,-20}|" -f $s.LastEntropy)     -ForegroundColor Yellow
        Write-Host (" | Actor Loss : {0,-20}|" -f $s.LastActorLoss)   -ForegroundColor Magenta
        Write-Host (" | Critic Loss : {0,-20}|" -f $s.LastCriticLoss)  -ForegroundColor Magenta
        Write-Host " +--------------------------------------+" -ForegroundColor Cyan
        Write-Host ""
    }
}


# ============================================================
# PPOENVIRONMENT -- CartPole simulation (same physics as DQN)
# ============================================================
# Kept separate from VBAFEnvironment so PPO.ps1 is self-contained.
# See VBAF.RL.Environment.ps1 for the shared environment abstraction.
class PPOEnvironment {
    [double] $Position
    [double] $Velocity
    [double] $Angle
    [double] $AngularVelocity
    [int]    $Steps
    [int]    $MaxSteps
    hidden [System.Random] $Rng

    PPOEnvironment() {
        $this.MaxSteps = 200
        $this.Rng      = [System.Random]::new()
        $this.Reset()
    }

    [double[]] Reset() {
        $this.Position        = ($this.Rng.NextDouble() - 0.5) * 0.1
        $this.Velocity        = ($this.Rng.NextDouble() - 0.5) * 0.1
        $this.Angle           = ($this.Rng.NextDouble() - 0.5) * 0.1
        $this.AngularVelocity = ($this.Rng.NextDouble() - 0.5) * 0.1
        $this.Steps           = 0
        return $this.GetState()
    }

    [double[]] GetState() {
        return @($this.Position, $this.Velocity, $this.Angle, $this.AngularVelocity)
    }

    [hashtable] Step([int]$action) {
        $this.Steps++
        $force     = if ($action -eq 1) { 1.0 } else { -1.0 }
        $gravity   = 9.8; $cartMass = 1.0; $poleMass = 0.1
        $totalMass = $cartMass + $poleMass; $halfLen = 0.25; $dt = 0.02

        $cosA = [Math]::Cos($this.Angle); $sinA = [Math]::Sin($this.Angle)
        $temp = ($force + $poleMass * $halfLen * $this.AngularVelocity * $this.AngularVelocity * $sinA) / $totalMass
        $aAcc = ($gravity * $sinA - $cosA * $temp) / ($halfLen * (4.0/3.0 - $poleMass * $cosA * $cosA / $totalMass))
        $acc  = $temp - $poleMass * $halfLen * $aAcc * $cosA / $totalMass

        $this.Position        += $dt * $this.Velocity
        $this.Velocity        += $dt * $acc
        $this.Angle           += $dt * $this.AngularVelocity
        $this.AngularVelocity += $dt * $aAcc

        $done   = ($this.Steps -ge $this.MaxSteps) -or ([Math]::Abs($this.Position) -gt 2.4) -or ([Math]::Abs($this.Angle) -gt 0.21)
        $reward = if (-not $done) { 1.0 } else { 0.0 }

        return @{ NextState = $this.GetState(); Reward = $reward; Done = $done }
    }
}


# ============================================================
# INVOKE-PPOTRAINING -- the PPO training loop
# ============================================================
#
# THE PPO TRAINING LOOP:
# ----------------------
# For each episode:
# 1. Reset environment
# 2. While not done:
# a. Call Act() -- Actor picks action stochastically
# b. Step environment -- get reward and next state
# c. Call StoreTransition() -- save (s, a, r, V, logP, done)
# d. If rollout buffer full: call Update(lastValue)
# 3. EndEpisode() -- record total reward
#
# UPDATE FREQUENCY:
# -----------------
# Unlike DQN which updates every 4 steps from a replay buffer,
# PPO updates every RolloutSteps (64) steps from the rollout buffer.
# The rollout is then DISCARDED -- PPO is on-policy.
#
# FAST MODE:
# ----------
# Uses smaller networks (16->16) and shorter episodes (30 steps).
# Reduces training time for quick tests.

function Invoke-PPOTraining {
    param(
        [int]    $Episodes   = 100,
        [int]    $PrintEvery = 10,
        [switch] $Quiet,
        [switch] $FastMode
    )

    $actorHidden  = @(64, 64)
    $criticHidden = @(64, 64)
    $maxSteps     = 200
    $rolloutSteps = 64

    if ($FastMode) {
        $actorHidden  = @(16, 16)
        $criticHidden = @(16, 16)
        $maxSteps     = 30
        $rolloutSteps = 32
        if ($Episodes   -eq 100) { $Episodes   = 50 }
        if ($PrintEvery -eq 10)  { $PrintEvery  = 5  }
        Write-Host ""
        Write-Host " FAST MODE -- smaller network, fewer steps" -ForegroundColor Yellow
        Write-Host " Actor/Critic : 16 -> 16" -ForegroundColor Yellow
        Write-Host " Episodes : $Episodes" -ForegroundColor Yellow
    }

    Write-Host ""
    Write-Host " VBAF PPO Training" -ForegroundColor Green
    Write-Host " Episodes: $Episodes" -ForegroundColor Cyan
    Write-Host ""

    $config                = [PPOConfig]::new()
    $config.StateSize      = 4
    $config.ActionSize     = 2
    $config.ActorHidden    = $actorHidden
    $config.CriticHidden   = $criticHidden
    $config.LearningRate   = 0.001
    $config.Gamma          = 0.99
    $config.LambdaGAE      = 0.95
    $config.ClipEpsilon    = 0.2
    $config.EntropyBonus   = 0.01
    $config.UpdateEpochs   = 4
    $config.RolloutSteps   = $rolloutSteps
    $config.MaxSteps       = $maxSteps

    # Build Actor: [StateSize -> hidden -> ActionSize]
    $actorLayers = [System.Collections.Generic.List[int]]::new()
    $actorLayers.Add($config.StateSize)
    foreach ($h in $config.ActorHidden) { $actorLayers.Add($h) }
    $actorLayers.Add($config.ActionSize)

    # Build Critic: [StateSize -> hidden -> 1]
    $criticLayers = [System.Collections.Generic.List[int]]::new()
    $criticLayers.Add($config.StateSize)
    foreach ($h in $config.CriticHidden) { $criticLayers.Add($h) }
    $criticLayers.Add(1)   # Single value output

    # Instantiate at script level -- PS 5.1 dependency injection
    $actor  = [NeuralNetwork]::new($actorLayers.ToArray(),  $config.LearningRate)
    $critic = [NeuralNetwork]::new($criticLayers.ToArray(), $config.LearningRate)
    $agent  = [PPOAgent]::new($config, $actor, $critic)

    $env          = [PPOEnvironment]::new()
    $env.MaxSteps = $maxSteps

    $bestReward  = 0.0
    $stepCounter = 0

    for ($ep = 1; $ep -le $Episodes; $ep++) {
        $state       = $env.Reset()
        $totalReward = 0.0
        $done        = $false

        while (-not $done) {
            $result  = $agent.Act($state)
            $action  = $result.Action
            $logProb = $result.LogProb
            $value   = $result.Value

            $step    = $env.Step($action)
            $ns      = $step.NextState
            $reward  = $step.Reward
            $done    = $step.Done

            $agent.StoreTransition($state, $action, $reward, $value, $logProb, $done)
            $state        = $ns
            $totalReward += $reward
            $stepCounter++

            # Update when rollout buffer is full
            if ($stepCounter % $config.RolloutSteps -eq 0) {
                $lastVal = $agent.Critic.Predict($state)[0]
                $agent.Update($lastVal)
            }
        }

        $agent.EndEpisode($totalReward)
        if ($totalReward -gt $bestReward) { $bestReward = $totalReward }

        if (-not $Quiet -and ($ep % $PrintEvery -eq 0)) {
            $stats = $agent.GetStats()
            Write-Host (" Ep {0,4} Reward: {1,5:F0} Best: {2,5:F0} Updates: {3,4} Entropy: {4:F3} CriticLoss: {5:F5}" -f `
                $ep, $totalReward, $bestReward, $stats.UpdateCount, $stats.LastEntropy, $stats.LastCriticLoss) -ForegroundColor White
        }
    }

    # Final update on any remaining rollout steps
    if ($agent.States.Count -gt 0) { $agent.Update(0.0) }

    Write-Host ""
    Write-Host " Training Complete!" -ForegroundColor Green
    $agent.PrintStats()
    ,$agent
}

# ============================================================
# QUICK REFERENCE
# ============================================================
#
# BASIC USAGE:
# . .\VBAF.LoadAll.ps1
# $agent = (Invoke-PPOTraining -Episodes 100 -PrintEvery 10)[-1]
# $agent.PrintStats()
#
# FAST TEST:
# $agent = (Invoke-PPOTraining -Episodes 5 -PrintEvery 1 -FastMode)[-1]
#
# COMPARE WITH DQN:
# $dqn = (Invoke-DQNTraining -Episodes 100 -FastMode)[-1]
# $ppo = (Invoke-PPOTraining -Episodes 100 -FastMode)[-1]
# $env = New-VBAFEnvironment -Name "CartPole"
# Invoke-VBAFBenchmark -Agent $dqn -Environment $env -Episodes 20 -Label "DQN"
# Invoke-VBAFBenchmark -Agent $ppo -Environment $env -Episodes 20 -Label "PPO"
#
# KEY DIFFERENCES FROM DQN TO WATCH:
# DQN: Epsilon decays from 1.0 to 0.01 (explore-exploit shift)
# PPO: Entropy decreases as policy becomes more confident
# DQN: One network (Q-values)
# PPO: Two networks (Actor + Critic)
# DQN: Off-policy (replay buffer with old experiences)
# PPO: On-policy (rollout buffer discarded after each update)
#
# SEE ALSO:
# VBAF.RL.DQN.ps1 -- Q-value based alternative
# VBAF.RL.A3C.ps1 -- asynchronous actor-critic (next algorithm)
# ============================================================

Write-Host " VBAF.RL.PPO.ps1 loaded" -ForegroundColor Green
Write-Host " Classes : PPOConfig, PPOAgent, PPOEnvironment" -ForegroundColor Cyan
Write-Host " Function : Invoke-PPOTraining"                  -ForegroundColor Cyan
Write-Host ""
Write-Host " Quick start:" -ForegroundColor Yellow
Write-Host ' $agent = (Invoke-PPOTraining -Episodes 50 -PrintEvery 5 -FastMode)[-1]' -ForegroundColor White
Write-Host ' $agent.PrintStats()' -ForegroundColor White
Write-Host ""