Skip to main content
Last updated on
Version: 1.0.0

Operational Notes

✅ Subscription & Identity

  • Correct subscription selected:
az account show --query id -o tsv
  • Account has permissions to create:
    • Resource Groups
    • VNet / Subnets
    • Application Gateway
    • Storage Account
    • Key Vault
    • Container Apps
    • Private Endpoints
    • Redis Enterprise
    • Cosmos Mongo vCore

✅ Naming & Configuration Review

Open 00_config.sh and verify:

  • Subscription ID
  • Resource Group name
  • Region (LOC)
  • Storage account name (globally unique)
  • ACR name (globally unique)
  • Key Vault name
  • CIDR ranges do not conflict with corp/VPN networks
  • Domain names are correct
  • Mongo region override = eastasia

✅ Required Environment Variables

Set before running scripts:

export AZ_SUBSCRIPTION_ID="..."
export MONGO_VCORE_ADMIN_PW="StrongPasswordHere"
export CERT_PFX_PATH="/path/to/prod.pfx"
export CERT_PFX_PASSWORD="..."

✅ Application Gateway Expectations

  • App Gateway uses --no-wait during creation.
  • Always wait for provisioning to complete before running config scripts.
  • Confirm with:
az network application-gateway show   -g eq-prod-resgroup   -n eq-prod-appgw   --query provisioningState -o tsv

Wait until:

Succeeded

✅ Execution Order Reminder

Run scripts in order:

01 → 07 → 08a/08b → 09 → 10 → 11 → 13 → 14 → 15

Stop immediately if any script fails.


EQ-PROD Deployment Notes & Troubleshooting

This file contains operational notes discovered during deployment and validation.


1. Azure CLI DNS Resolution Error

Symptom

HTTPSConnectionPool(host='login.microsoftonline.com', port=443): Max retries exceeded with url: /organizations/v2.0/.well-known/openid-configuration NameResolutionError: Failed to resolve 'login.microsoftonline.com'

Meaning

The machine running Azure CLI cannot resolve public DNS names. This is NOT caused by the deployment scripts.

Azure authentication requires outbound DNS + HTTPS connectivity.


2. Quick Validation Commands

Run these on the machine executing the scripts:

nslookup login.microsoftonline.com
nslookup management.azure.com
curl -I https://login.microsoftonline.com -m 10

If these fail → DNS or outbound networking is broken.


3. Fix Scenarios

A. Linux VM

Check resolver:

cat /etc/resolv.conf

Temporary fix:

sudo bash -c 'cat > /etc/resolv.conf <<EOF
nameserver 8.8.8.8
nameserver 1.1.1.1
EOF'
sudo systemctl restart systemd-resolved

Re-test:

nslookup login.microsoftonline.com
az login

B. WSL2 (Most Common)

  1. Replace DNS:
sudo rm -f /etc/resolv.conf
sudo bash -c 'cat > /etc/resolv.conf <<EOF
nameserver 1.1.1.1
nameserver 8.8.8.8
EOF'
  1. Prevent auto overwrite:
sudo bash -c 'cat > /etc/wsl.conf <<EOF
[network]
generateResolvConf = false
EOF'
  1. Restart WSL from Windows PowerShell:
wsl --shutdown

Then retry:

nslookup login.microsoftonline.com
az login

C. Corporate Proxy

Set proxy variables:

export HTTPS_PROXY="http://user:pass@proxy.company.com:8080"
export HTTP_PROXY="http://user:pass@proxy.company.com:8080"
export NO_PROXY="localhost,127.0.0.1,.azure.com,.microsoftonline.com"

4. Network Requirements for Azure CLI

Outbound access required: - DNS: UDP/TCP 53 - HTTPS: TCP 443 to: - login.microsoftonline.com - management.azure.com - graph.microsoft.com


5. App Gateway → Internal ACA Validation

Backend Health Check

az network application-gateway show-backend-health   -g eq-prod-resgroup   -n eq-prod-appgw -o table

Expected: all backends Healthy.


DNS Validation from Inside VNet

From a VM inside the VNet:

nslookup <containerapp-fqdn>
curl -vk https://<containerapp-fqdn>/healthz

Expected: - Private IP resolution - Successful HTTPS response


6. Common Causes of AppGW Unhealthy Backends

  • Custom DNS servers not forwarding Azure private zones
  • Wrong probe path or protocol
  • NSG blocking AppGW subnet → ACA infra subnet
  • Incorrect listener or backend settings

7. Ingress Model

  • Container Apps: internal ingress only
  • Public exposure: Application Gateway only
  • No direct public access to ACA endpoints

End of notes.


8. Application Gateway --no-wait Behavior

The script 13_appgw_create.sh uses the flag:

--no-wait

This means the command returns immediately while Application Gateway is still provisioning in the background. If you continue to run:

  • 14_appgw_config_backend.sh
  • 15_appgw_url_path_map.sh

before provisioning finishes, the commands may fail or partially configure resources.

✅ Required Action

Always verify that App Gateway provisioning is complete before continuing:

az network application-gateway show   -g eq-prod-resgroup   -n eq-prod-appgw   --query provisioningState -o tsv

Wait until it returns:

Succeeded

Only then continue with the next scripts.

⚠️ Common Symptoms if Skipped

  • Resource not found errors
  • Listener / frontend-port creation failures
  • Backend pool conflicts
  • Random intermittent CLI failures


9. Most Common Issues (So You’re Not Surprised)

❌ Azure CLI cannot login / random REST failures

Symptoms

  • DNS resolution errors
  • Timeouts to login.microsoftonline.com
  • Intermittent REST failures

Root cause

  • Local DNS misconfiguration
  • Corporate proxy not configured
  • Firewall blocking outbound 443 / 53

Fix

  • Follow Section 1–4 in this document to fix DNS / proxy / outbound access.

❌ Application Gateway backends show Unhealthy

Symptoms

  • Backend health shows Unhealthy, Unknown, or Timeout
  • No traffic reaches Container Apps

Root cause

  • DNS resolution failure from AppGW subnet
  • Probe path mismatch (e.g., /healthz not implemented)
  • Wrong protocol (HTTP vs HTTPS)
  • NSG blocking traffic between AppGW subnet and ACA infra subnet

Fix

  • Validate DNS from a VM inside the VNet.
  • Confirm probe path exists on the app.
  • Verify NSG rules allow outbound traffic from AppGW subnet.
  • Check backend protocol and port.

❌ Private Endpoint stuck in Pending Approval

Symptoms

  • Redis / Mongo private endpoint never becomes usable
  • DNS resolves but connection fails

Root cause

  • Private endpoint connection requires manual approval
  • RBAC restrictions on resource owner

Fix

az network private-endpoint-connection list --id <resource-id>
az network private-endpoint-connection approve --id <connection-id>

❌ Container App image pull failures

Symptoms

  • Revisions fail to start
  • Logs show ImagePullBackOff or authentication errors

Root cause

  • Managed Identity not granted AcrPull
  • Registry identity not configured on the app

Fix

az containerapp identity show -g <rg> -n <app>
az role assignment list --assignee <principalId>
az containerapp registry show -g <rg> -n <app>

Re-run:

11_identities_and_acr_pull.sh

❌ App Gateway configuration scripts fail after creation

Symptoms

  • Listener creation fails
  • Frontend port already exists
  • Random resource not found errors

Root cause

  • App Gateway still provisioning (--no-wait was used)

Fix Wait until provisioning completes:

az network application-gateway show   -g eq-prod-resgroup   -n eq-prod-appgw   --query provisioningState -o tsv

Only proceed when status is Succeeded.


❌ WebSocket disconnects or drops unexpectedly

Symptoms

  • WS connections drop after a few minutes
  • Random reconnects

Root cause

  • AppGW backend timeout too low
  • Idle timeout mismatch

Fix

  • Ensure WebSocket backend http-settings timeout is high (e.g., 1800s or more).
  • Verify application keepalive behavior.


10. Storage Share Creation Error: --auth-mode login

Symptom

unrecognized arguments: --auth-mode login

Root Cause

Older Azure CLI versions do not support --auth-mode login for:

az storage share create

Fix Applied

Script 07_storage_share.sh now:

  • Retrieves the storage account key automatically:
az storage account keys list
  • Uses:
--account-name
--account-key

for share creation and ACA environment attachment.

No CLI upgrade is required.



Mongo vCore schema errors: missing highAvailability / storage type / password complexity

If you see errors like:

  • required property of 'highAvailability'
  • storage contained undefined properties: 'type'
  • Password must be between 8 & 256 characters, and contain 3 of the following...

It means:

  • Your admin password doesn't meet Azure complexity requirements, and/or
  • Your request body doesn't match the schema for the API version you're calling.

Fixes applied in 09_mongo_vcore.sh:

  • Validates password complexity before calling Azure.
  • Always includes highAvailability.targetMode (default: Disabled).
  • Omits storage.type for older API versions (2024-07-01 / 2024-10-01-preview), and includes it only for 2025-* API versions.

You can tune HA with:

  • MONGO_HA_TARGET_MODE=Disabled|SameZone|ZoneRedundantPreferred


Mongo vCore: High Availability not available for 'M30' cluster tier

If you see:

High Availability not available for 'M30' cluster tier.

Fix:

  • Do NOT send the highAvailability object at all for M30.
  • In 00_config.sh, leave MONGO_HA_TARGET_MODE empty (default), so the script omits HA completely.

Example:

export MONGO_HA_TARGET_MODE=""
bash 09_mongo_vcore.sh


Mongo vCore: highAvailability required by schema (but you can still run "no HA")

Some API versions validate highAvailability as a required property. In those cases:

  • Set highAvailability.targetMode to Disabled to indicate no HA.

For older preview schemas, nodeGroupSpecs[*].enableHa is required:

  • Set it to false (no HA).

The script 09_mongo_vcore.sh now does this automatically for M30.



Config documentation

00_config.sh has become long. A dedicated reference is provided:

  • CONFIG.md — documentation for configuration variables

Keep secrets out of 00_config.sh (use pipeline/env vars).


Mongo vCore: HA fields can be schema-required

Your tenant validated that:

  • New-schema API versions (2024-07-01, 2024-10-01-preview, 2025-*) require highAvailability.
  • Older preview schemas require nodeGroupSpecs[*].enableHa and diskSizeGB.

09_mongo_vcore.sh now:

  • sends highAvailability.targetMode=Disabled (no HA) for new schema
  • sends enableHa=false + diskSizeGB=<size> for old schema

This keeps the deployment no-HA, but passes schema validation.



File 12 error: (ContainerAppInvalidEnvVarName) template.containers[0].volumeMounts

If you see: Env variable name 'template.containers[0].volumeMounts' contains invalid character

Cause:

  • az containerapp update does not support generic --set for template fields; it interprets the key as an env var name.

Fix (applied in v22):

  • Use az resource update to set:
    • properties.template.volumes
    • properties.template.containers[0].volumeMounts


Domain configuration

Domain variables are centralized in 00_config.sh:

  • APP_PUBLIC_DOMAIN
  • APPGW_PUBLIC_FQDN

These are safe to keep as placeholders during infra bootstrap.