4) Technical and professional questions (the real filter)
This is where you separate yourself from candidates who “used the cloud” from those who operated it. US interviewers often probe depth by zooming into one decision: IAM boundaries, network routing, Terraform state, Kubernetes upgrades, or how you’d design for failure.
Q: Walk me through how you design a multi-account (or multi-subscription) cloud landing zone.
Why they ask it: They’re testing whether you can build scalable governance, not just deploy resources.
Answer framework: Architecture “Layers” framework: identity, networking, guardrails, logging, shared services.
Example answer: “I start with identity and access: centralized SSO, role-based access, and break-glass accounts. Then I design the account/subscription structure around environments and blast radius—prod separated from non-prod, plus shared services. Networking comes next: hub-and-spoke with clear ingress/egress controls and DNS strategy. I enforce guardrails with policy-as-code, central logging, and standardized tagging. The goal is a landing zone where teams can self-serve safely.”
Common mistake: Jumping straight to VPC/VNet diagrams without governance, logging, and access strategy.
Q: How do you manage Terraform state and prevent unsafe applies?
Why they ask it: They’re testing operational maturity around IaC.
Answer framework: “State–Workflow–Controls.” Explain backend choice, CI workflow, and policy checks.
Example answer: “I use remote state with locking—like S3 + DynamoDB or Terraform Cloud—so concurrent applies don’t corrupt state. Changes go through PRs with plan output visible, and applies are gated in CI with approvals for prod. I separate state by environment and sometimes by domain to limit blast radius. For controls, I add policy checks (OPA/Sentinel) for things like public exposure, encryption, and tagging. The goal is boring, repeatable infrastructure changes.”
Common mistake: Treating Terraform like a local tool and ignoring locking, drift, and review gates.
Q: As an AWS Engineer, how would you design private connectivity from on-prem to AWS for critical workloads?
Why they ask it: They’re testing network fundamentals and tradeoffs.
Answer framework: Compare-and-choose: options, constraints, decision, validation.
Example answer: “I’d start with requirements: bandwidth, latency, redundancy, and compliance. For critical workloads, I’d typically prefer Direct Connect with redundant links and a VPN as backup, terminating into a transit gateway. I’d design routing with clear segmentation, and I’d validate failover by testing route changes and monitoring BGP sessions. If cost or lead time is an issue, I’d begin with site-to-site VPN but plan a migration path to Direct Connect.”
Common mistake: Naming services (TGW, DX) without explaining redundancy and failure testing.
Q: As an Azure Engineer, how do you implement least-privilege access at scale?
Why they ask it: They’re testing whether you can keep RBAC sane as org complexity grows.
Answer framework: “Scope–Role–Lifecycle.” Where permissions live, how they’re granted, how they’re reviewed.
Example answer: “I use management groups and subscriptions to scope access, then assign RBAC roles at the highest safe level to avoid permission sprawl. I rely on Entra ID groups rather than individual assignments, and I use PIM for just-in-time elevation for sensitive roles. For service principals/managed identities, I scope permissions to resource groups or specific resources. Finally, I set up access reviews and logging so we can prove who had access and when.”
Common mistake: Hand-waving with ‘we use Contributor’ everywhere because it’s faster.
Q: As a GCP Engineer, how would you structure projects and IAM for a team running multiple services?
Why they ask it: They’re testing whether you understand GCP’s resource hierarchy and IAM model.
Answer framework: “Org–Folder–Project” mapping plus separation of duties.
Example answer: “I map environments and major domains into folders and projects so billing, quotas, and IAM boundaries are clean. I keep prod projects separate from dev/test, and I use service accounts per workload with minimal permissions. Shared services like logging, monitoring, and networking often live in dedicated projects. I also standardize IAM via groups and use org policies to enforce constraints like no public buckets and required encryption.”
Common mistake: Putting everything into one project and trying to ‘manage’ it with naming conventions.
Q: Explain the difference between RTO and RPO, and how you design for them in cloud.
Why they ask it: They’re testing disaster recovery thinking, not just high availability buzzwords.
Answer framework: Define → map to architecture → validate with tests.
Example answer: “RTO is how fast you need to recover; RPO is how much data you can afford to lose. If RPO is near-zero, I’ll design synchronous replication or managed database features that support it, and I’ll focus on write-path durability. If RTO is tight, I’ll use warm standby or active-active patterns and automate failover. The key is aligning architecture cost with business requirements, then proving it with game days and restore tests.”
Common mistake: Saying “multi-region” as a universal answer without cost and complexity tradeoffs.
Q: What’s your approach to observability for cloud platforms—metrics, logs, traces?
Why they ask it: They’re testing whether you can debug distributed systems under pressure.
Answer framework: “Golden signals + correlation.” What you collect, how you connect it, what you alert on.
Example answer: “I start with golden signals—latency, traffic, errors, saturation—and define SLOs so alerts reflect user impact. Logs are structured and centralized with consistent fields like request IDs and tenant IDs. Traces are critical for microservices; I make sure propagation is standardized so we can follow a request end-to-end. I also tune alerts to avoid noise: paging is for SLO burn or hard failures, not every CPU spike.”
Common mistake: Confusing ‘more dashboards’ with observability, and paging on everything.
Q: How do you secure secrets in CI/CD and runtime?
Why they ask it: They’re testing whether you prevent the most common cloud breach patterns.
Answer framework: “Store–Access–Rotate–Audit.”
Example answer: “I avoid storing secrets in repos or CI variables long-term. I use a managed secrets store and grant access via workload identity—like IAM roles, managed identities, or service accounts—so apps fetch secrets at runtime. Rotation is automated where possible, and I log access to detect anomalies. In CI/CD, I use short-lived credentials (OIDC federation) instead of static keys.”
Common mistake: Relying on long-lived access keys and hoping nobody leaks them.
Q: What US compliance or security standards have you worked with, and how did they affect your cloud design?
Why they ask it: They’re testing whether you can operate in regulated environments common in the US.
Answer framework: “Standard → control → implementation.”
Example answer: “I’ve worked in environments aligned to SOC 2 controls, and I’ve supported HIPAA-adjacent workloads. Practically, that meant enforcing encryption at rest and in transit, strong access controls with audit trails, and documented change management. We implemented centralized logging, retention policies, and least-privilege IAM with periodic reviews. The standard isn’t the work—the controls are, and I translate them into guardrails teams can live with.”
Common mistake: Name-dropping ‘SOC 2’ or ‘HIPAA’ without describing concrete technical controls.
Q: A deployment breaks connectivity because a network policy change went wrong. What do you do first?
Why they ask it: They’re testing incident triage and rollback instincts.
Answer framework: Triage–Stabilize–Diagnose–Fix–Prevent.
Example answer: “First I stabilize: stop the bleeding by pausing pipelines and rolling back the last known-good network change if possible. I confirm scope—what regions, what services, what paths are broken—using health checks and logs. Then I validate whether it’s routing, security groups/NSGs, NACLs, firewall rules, or DNS. Once service is restored, I do a post-incident review and add pre-deploy validation, like policy checks and staged rollouts, so a single change can’t take down the whole path again.”
Common mistake: Diving into root cause before restoring service or controlling blast radius.
Q: How do you handle container platform upgrades (like Kubernetes) without downtime?
Why they ask it: They’re testing whether you can run platform changes like a product.
Answer framework: “Plan–Stage–Migrate–Verify.”
Example answer: “I start by reading the deprecation notes and mapping impacted APIs and add-ons. I upgrade in a staging cluster first, then use blue/green or node pool rotation in prod so workloads drain gracefully. I validate with synthetic checks and SLO monitoring during the rollout. If something fails, I have a rollback path—either reverting node pools or shifting traffic back to the old cluster.”
Common mistake: Treating upgrades as a one-click operation and discovering breaking changes in production.