Private/Invoke-AzResourceGraphQuery.ps1
|
function Invoke-AzResourceGraphQuery { <# .SYNOPSIS Runs an Azure Resource Graph query via 'az graph query' and transparently follows skip_token pagination until all rows are returned. .DESCRIPTION The Azure CLI returns at most --first rows per call (max 1000). When a fleet has more than 1000 clusters the caller was previously receiving only a truncated first page. This helper loops on the response's skip_token field, aggregating .data across pages and returning the merged row array. Safety cap: MaxPages (default 100 -> 100,000 rows). Prevents a bug in the caller's query from producing an infinite pagination loop. When the cap is hit, a Write-Warning is emitted, the module-scope flag $script:LastResourceGraphQueryTruncated is set to $true, and the partial result is returned. Callers that need to behave differently on truncation can read the flag after the call. v0.7.68: the Query string is normalised (CR/LF and runs of whitespace collapsed to single spaces) BEFORE being passed to 'az graph query -q'. On Windows, az is implemented as az.cmd (a batch file), and the CMD argument parser truncates command-line arguments at the first CR/LF - so a multi-line PowerShell here-string KQL would silently be reduced to just its first line on the runner. The pre-v0.7.68 behaviour caused Test-AzLocalApplyUpdatesScheduleCoverage to silently return all resources (default schema, no UpdateRing/UpdateWindow columns) instead of the projected cluster rows it asked for; the audit then reported zero tagged clusters even when clusters were tagged correctly. The normalisation here protects every caller, current and future. KQL is whitespace-agnostic, so collapsing newlines/tabs is semantically a no-op. .PARAMETER Query KQL query string. Normalised to single-line before being passed to 'az graph query -q'. See .DESCRIPTION for the Windows az.cmd reason. .PARAMETER SubscriptionId Optional. If supplied, scopes the query to that subscription via --subscriptions. Omit to query across all accessible subscriptions. .PARAMETER First Page size. Defaults to 1000 (the ARG maximum). .PARAMETER MaxPages Safety cap. Defaults to 100 (= 100,000 rows). Bumped from 50 in v0.7.68 so that fleets up to ~100K clusters paginate without operator intervention. .PARAMETER MaxRetries v0.7.68: per-page retry budget for transient ARG throttling (HTTP 429 / RateLimitingException). Each page can be retried up to this many times with exponential backoff + jitter before the call gives up and throws. Defaults to 5 (initial attempt + 5 retries = 6 total tries per page). .PARAMETER RetryBaseSeconds v0.7.68: base delay for the exponential backoff on throttle retries. Defaults to 1 second; attempt N sleeps `RetryBaseSeconds * 2^(N-1)` seconds (1, 2, 4, 8, 16, ...) with +/-20% random jitter. The minimum effective sleep is 0.5s. .PARAMETER DisableCrossCallCooldown v0.7.84: opt out of the cross-call ARG cooldown wait at the start of this call. The cooldown is a small voluntary pause that this helper applies on entry if a PREVIOUS call to this helper observed ARG throttling. It is the cross-call companion to the per-page retry loop: the retry loop protects against bursts WITHIN one call; the cooldown protects against bursts ACROSS sequential calls (typical of fan-out fleet queries that issue many ARG requests back-to-back). Use this switch only in unit tests or when the caller has its own rate-limiting upstream. .NOTES v0.7.68 throttle handling exposes two module-scope diagnostic flags reset at the start of every call and readable by callers/tests: $script:LastResourceGraphThrottled - $true if any retry happened $script:LastResourceGraphRetryCount - total retries across all pages Recognised throttle markers in the CLI error text (case-insensitive): 'rate limit', 'ratelimit', 'throttl', '429', 'too many requests'. Once a throttle event is observed during a call, an inter-page pause of 1 second is inserted between subsequent pages to avoid immediately re-triggering the limiter. v0.7.84 cross-call coordination adds two more module-scope items: $script:ArgCrossCallCooldownUntil - [DateTime]; until this instant the NEXT call voluntarily sleeps on entry. Set on throttle. $script:ArgConsecutiveThrottledCalls - int counter for adaptive cooldown duration (capped 5). $script:LastResourceGraphCrossCallCooldownSeconds - per-call diagnostic; how long THIS call slept on entry due to the cooldown. The cross-call counter decays by 1 on every clean (un-throttled) call so the cooldown does not accumulate forever. .OUTPUTS [object[]] of rows merged across all pages. Empty array if no rows. Throws if the CLI returns a non-zero exit code or the response cannot be parsed as JSON. #> [CmdletBinding()] [OutputType([object[]])] param( [Parameter(Mandatory = $true)] [ValidateNotNullOrEmpty()] [string]$Query, [Parameter(Mandatory = $false)] [string]$SubscriptionId, [Parameter(Mandatory = $false)] [ValidateRange(1, 1000)] [int]$First = 1000, [Parameter(Mandatory = $false)] [ValidateRange(1, 500)] [int]$MaxPages = 100, [Parameter(Mandatory = $false)] [ValidateRange(0, 20)] [int]$MaxRetries = 5, [Parameter(Mandatory = $false)] [ValidateRange(0.1, 60)] [double]$RetryBaseSeconds = 1, [Parameter(Mandatory = $false)] [switch]$DisableCrossCallCooldown ) # v0.7.68: collapse CR/LF and runs of whitespace into single spaces so the # query survives az.cmd's CMD argument parser on Windows. See .DESCRIPTION. # KQL is whitespace-agnostic in its grammar, so this is semantically inert # for every well-formed query. Leading/trailing whitespace is also trimmed # to keep the eventual command-line clean. # # v0.7.70 caller-contract note: callers MUST NOT embed KQL line-comments # (`// ...`) inside the here-string that they pass to this helper. KQL # line-comments terminate at a newline; once this function collapses the # newlines to spaces (above), a `//` runs all the way to end-of-input and # silently eats the rest of the query. The ARG parser then fails with # "expected | got <EOF>" at the truncation point. We deliberately do NOT # strip `//` here because a naive strip would also break URL literals # such as `'https://portal.azure.com/...'`. If you need to annotate your # KQL, do it in PowerShell `#` comments OUTSIDE the here-string. $Query = ($Query -replace '\s+', ' ').Trim() # Reset the truncation flag at the start of every call so a caller checking # $script:LastResourceGraphQueryTruncated sees only THIS call's outcome. $script:LastResourceGraphQueryTruncated = $false # v0.7.68: throttle diagnostics, reset per call. Tests and pipeline # callers can read these flags to detect/report transient ARG throttling. $script:LastResourceGraphThrottled = $false $script:LastResourceGraphRetryCount = 0 # v0.7.84: cross-call ARG cooldown coordination. The per-page retry loop # handles bursts WITHIN one call; this block adds coordination ACROSS # sequential calls. State is module-scope and initialised lazily on # first use. If a previous call set a cooldown window that is still # active, sleep out the remainder before issuing the first page request # of THIS call. The $DisableCrossCallCooldown switch is provided for # unit tests and for callers that have their own rate-limiting upstream. if (-not (Test-Path Variable:script:ArgCrossCallCooldownUntil)) { $script:ArgCrossCallCooldownUntil = [DateTime]::MinValue $script:ArgConsecutiveThrottledCalls = 0 } $script:LastResourceGraphCrossCallCooldownSeconds = 0 if (-not $DisableCrossCallCooldown.IsPresent) { $now = Get-Date if ($script:ArgCrossCallCooldownUntil -gt $now) { $waitSecs = ($script:ArgCrossCallCooldownUntil - $now).TotalSeconds if ($waitSecs -gt 0) { Write-Verbose ("Invoke-AzResourceGraphQuery: applying cross-call ARG cooldown ({0:N2}s) due to recent throttling." -f $waitSecs) Start-Sleep -Milliseconds ([int]($waitSecs * 1000)) $script:LastResourceGraphCrossCallCooldownSeconds = $waitSecs } } } # Inter-page pause (milliseconds). Starts at 0; ratchets to 1000ms after # the first throttle event in this call so subsequent pages don't # immediately re-trigger the limiter. $interPagePauseMs = 0 $allRows = [System.Collections.Generic.List[object]]::new() $skipToken = $null $pages = 0 # Force Azure CLI (Python) to write UTF-8 to stdout/stderr regardless of the # host console code page. Without this, any non-cp1252 character in an ARG # response (resource IDs from non-en-US tenants, localised update names, # cluster property strings, etc.) causes the CLI to emit a stderr warning # like "WARNING: Unable to encode the output with cp1252 encoding..." which, # when captured via 2>&1, gets prepended to the JSON and breaks # ConvertFrom-Json. See the matching hardening in Invoke-AzRestJson.ps1 # (the v0.7.2 cp1252 fix). NOTE: az.cmd launches python with -I (isolated), # which causes python to ignore PYTHONIOENCODING / PYTHONUTF8; the env-var # assignment is therefore best-effort. The hard fix is the stderr/stdout # split below: stderr lines surface as ErrorRecord objects under 2>&1 and # we only feed the stdout strings to ConvertFrom-Json. $prevPyEncoding = $env:PYTHONIOENCODING try { $env:PYTHONIOENCODING = 'utf-8' while ($true) { $pages++ if ($pages -gt $MaxPages) { $script:LastResourceGraphQueryTruncated = $true Write-Warning "Invoke-AzResourceGraphQuery: reached MaxPages=$MaxPages safety cap; returning partial result ($($allRows.Count) rows). Check the query for unbounded output or raise -MaxPages. Callers can detect this via `$script:LastResourceGraphQueryTruncated." break } $azArgs = @('graph', 'query', '-q', $Query, '--first', $First, '--only-show-errors') if ($SubscriptionId) { $azArgs += @('--subscriptions', $SubscriptionId) } if ($skipToken) { $azArgs += @('--skip-token', $skipToken) } # v0.7.68: per-page retry loop for transient ARG throttling. ARG # returns HTTP 429 / RateLimitingException when the caller exceeds # the per-subscription quota; the CLI surfaces those in the error # stream. Non-throttle failures (auth, bad KQL, permissions) fall # straight through to the throw path - we do NOT retry those. $retryAttempt = 0 $raw = $null $exit = $null $stderrLines = @() $stdoutLines = @() while ($true) { $raw = & az @azArgs 2>&1 $exit = $LASTEXITCODE # Split merged stdout+stderr by stream type. Stderr lines # (Python warnings, deprecation notices, throttle errors) # surface as ErrorRecord objects when using 2>&1; stdout # lines surface as strings. We only pass the string stream to # ConvertFrom-Json so a stray stderr warning can never corrupt # JSON. $stderrLines = @($raw | Where-Object { $_ -is [System.Management.Automation.ErrorRecord] }) $stdoutLines = @($raw | Where-Object { $_ -isnot [System.Management.Automation.ErrorRecord] }) if ($exit -eq 0) { break } $errText = ((($stderrLines + $stdoutLines) | Out-String).Trim()) $isThrottle = $errText -match '(?i)(rate.?limit|throttl|\b429\b|too many requests)' if ($isThrottle -and $retryAttempt -lt $MaxRetries) { $retryAttempt++ $script:LastResourceGraphThrottled = $true $script:LastResourceGraphRetryCount++ # v0.7.84: update cross-call cooldown so the NEXT call to # this helper (whatever the caller) waits this out before # its first page. Adaptive: more consecutive throttled # calls -> longer cooldown, capped at 10s and counter # capped at 5 so cooldown never exceeds the cap. $script:ArgConsecutiveThrottledCalls = [Math]::Min(5, $script:ArgConsecutiveThrottledCalls + 1) $cooldownSecs = [Math]::Min(10.0, $RetryBaseSeconds * [Math]::Pow(2, $script:ArgConsecutiveThrottledCalls - 1)) $script:ArgCrossCallCooldownUntil = (Get-Date).AddSeconds($cooldownSecs) $baseDelay = $RetryBaseSeconds * [Math]::Pow(2, $retryAttempt - 1) # +/-20% jitter, floor 0.5s $jitterFactor = 1 + (Get-Random -Minimum -0.2 -Maximum 0.2) $delay = [Math]::Max(0.5, $baseDelay * $jitterFactor) Write-Warning ("Invoke-AzResourceGraphQuery: ARG throttled on page {0} (attempt {1}/{2}); sleeping {3:N2}s before retry." -f $pages, $retryAttempt, $MaxRetries, $delay) Start-Sleep -Seconds $delay # Ratchet the inter-page pause so the NEXT page also waits if ($interPagePauseMs -lt 1000) { $interPagePauseMs = 1000 } continue } # Either not a throttle, or out of retries: fail hard. throw "Azure Resource Graph query failed (exit $exit): $(ConvertTo-ScrubbedCliOutput -Text $errText)" } $rawText = ($stdoutLines | Out-String).Trim() if ([string]::IsNullOrWhiteSpace($rawText)) { break } try { $parsed = $rawText | ConvertFrom-Json -ErrorAction Stop } catch { throw "Azure Resource Graph query failed to parse JSON: $($_.Exception.Message); raw: $(ConvertTo-ScrubbedCliOutput -Text $rawText.Substring(0, [Math]::Min(500, $rawText.Length)))" } # 'az graph query' returns either a top-level array (older CLI) or an # object with .data / .skip_token (newer CLI). Normalise. $rows = $null $nextToken = $null if ($parsed -is [System.Array]) { $rows = $parsed } elseif ($parsed.PSObject.Properties.Name -contains 'data') { $rows = $parsed.data if ($parsed.PSObject.Properties.Name -contains 'skip_token') { $nextToken = $parsed.skip_token } elseif ($parsed.PSObject.Properties.Name -contains 'skipToken') { $nextToken = $parsed.skipToken } } else { # Unknown shape - treat as single-row result $rows = @($parsed) } if ($rows) { foreach ($row in $rows) { [void]$allRows.Add($row) } } if (-not $nextToken) { break } $skipToken = $nextToken Write-Verbose "Invoke-AzResourceGraphQuery: fetched page $pages ($($allRows.Count) rows so far); following skip_token for next page." # v0.7.68: pause between pages if we have observed throttling in # this call, to avoid immediately re-triggering the limiter on # the next page request. if ($interPagePauseMs -gt 0) { Start-Sleep -Milliseconds $interPagePauseMs } } } finally { $env:PYTHONIOENCODING = $prevPyEncoding } # v0.7.84: decay the cross-call consecutive-throttle counter on a clean # call (no throttle observed during THIS call) so the cooldown window # does not accumulate forever in long-running scripts. One clean call # decreases the counter by 1; persistent throttling keeps it pinned # near the cap of 5 (-> 10s cooldown). if (-not $script:LastResourceGraphThrottled -and $script:ArgConsecutiveThrottledCalls -gt 0) { $script:ArgConsecutiveThrottledCalls = [Math]::Max(0, $script:ArgConsecutiveThrottledCalls - 1) } # IMPORTANT: the leading comma is required to preserve array shape so that # callers using `$x = Invoke-AzResourceGraphQuery ...` receive an [object[]] # for 0, 1, or N rows (not $null, scalar, or unwrapped enumerable). # WARNING: callers MUST NOT wrap this call with @( ... ). The `,`-return # plus `@()` combination produces a double-wrapped Object[1] containing the # inner array, which silently collapses N rows to 1 row of property-arrays. # Use `$x = Invoke-AzResourceGraphQuery ...` directly; the result is always # an array. return , $allRows.ToArray() } |