4) Technical and professional questions (what separates real DBAs)
This is the heart of the US DBA interview. Expect follow-ups. Expect “what if that doesn’t work?” And expect them to care about your reasoning more than your memorization.
Q: Explain your backup strategy and how you validate restores.
Why they ask it: Backups you can’t restore are theater; they want proof you’ve tested recovery.
Answer framework: Policy → Mechanics → Verification. Include RPO/RTO, retention, and restore testing cadence.
Example answer: “I design backups around RPO/RTO first—then pick the mechanics. For example, daily full plus frequent incrementals/log backups, encrypted at rest, with retention aligned to business and legal needs. The key is validation: I schedule automated restore tests to a non-prod environment, run integrity checks, and verify application-level sanity where possible. I also document the exact restore sequence so it’s not improvised during an incident.”
Common mistake: Listing backup types but never mentioning restore drills or verification.
Q: How do you design high availability and disaster recovery, and how do you choose between them?
Why they ask it: They want to see if you understand HA vs. DR and can map them to business requirements.
Answer framework: Requirements → Architecture → Failure modes. Speak in terms of RTO/RPO, region failure, and operational complexity.
Example answer: “I separate HA from DR: HA handles local failures with minimal downtime; DR handles site/region loss. I start with RTO/RPO targets and budget, then choose patterns like synchronous replication for HA and asynchronous cross-region replication for DR. I also plan for failover testing, DNS/connection string strategy, and what happens to writes during split-brain scenarios. The best design is the one you can actually operate at 2 AM.”
Common mistake: Saying ‘we have replication’ as if replication automatically equals DR.
Q: You’re interviewing for a SQL DBA role. How do you troubleshoot a sudden spike in query latency on SQL Server?
Why they ask it: They’re testing whether you can triage with the right DMVs and isolate root cause fast.
Answer framework: Triage ladder: “Is it waits? Is it blocking? Is it IO? Is it plan regression?”
Example answer: “First I check whether it’s system-wide or isolated—CPU, IO latency, memory pressure, and wait stats. Then I look for blocking chains and long-running transactions, and I identify top queries by duration and reads. If it’s a plan regression, I compare recent plan changes and parameter sniffing behavior, and I’ll consider forcing a known-good plan or updating stats as a controlled fix. I keep changes reversible and I narrate what I’m doing to the incident channel.”
Common mistake: Jumping straight to ‘add indexes’ without proving the bottleneck.
Q: For an Oracle DBA position, how do you approach performance tuning without causing instability?
Why they ask it: Oracle tuning can be powerful—and dangerous—if you change too much at once.
Answer framework: Baseline → Hypothesis → Minimal change → Measure. Mention AWR/ASH and safe rollout.
Example answer: “I start with a baseline from AWR/ASH to see where time is spent—CPU, IO, contention, parsing. I focus on the biggest contributor first and make one controlled change at a time, like a targeted index, SQL profile, or stats strategy. I validate in a staging environment when possible and schedule production changes with rollback steps. The goal is measurable improvement without introducing new variance.”
Common mistake: Treating tuning as a bag of tricks instead of a disciplined experiment.
Q: How do you handle index strategy and maintenance for large OLTP tables?
Why they ask it: They want to see if you understand write amplification, fragmentation, and workload patterns.
Answer framework: Workload-first: “queries → access paths → index cost.” Then maintenance policy.
Example answer: “I start from the workload: top queries, join patterns, and predicates. I add indexes to support critical paths, but I’m careful about over-indexing because every write pays the bill. For maintenance, I use engine-appropriate strategies—rebuild/reorganize thresholds, fill factor choices, and stats updates—based on fragmentation and plan stability. I also watch for unused indexes and remove them with evidence.”
Common mistake: Saying ‘index everything used in WHERE’—that’s how you kill write performance.
Q: How do you manage database security: least privilege, secrets, and auditing?
Why they ask it: In the US, security teams will expect you to speak their language, not just “we have passwords.”
Answer framework: Controls stack: Identity → Authorization → Encryption → Audit.
Example answer: “I prefer role-based access with least privilege and separation of duties—no shared admin accounts. Secrets live in a vault, rotated, and never hardcoded in app configs. I enforce encryption in transit and at rest, and I enable auditing for privileged actions and sensitive tables, with logs shipped to a central system. When possible, I use short-lived credentials and MFA for admin access.”
Common mistake: Treating security as only ‘network firewall’ and ignoring identity, roles, and audit trails.
Q: What US regulations or standards have influenced how you run databases?
Why they ask it: They’re checking whether you can operate under compliance constraints common in US industries.
Answer framework: Standard → Control → Implementation. Pick what’s relevant (SOC 2, HIPAA, PCI DSS) and describe concrete controls.
Example answer: “I’ve worked in environments aligned with SOC 2 controls, where change management, access reviews, and audit logging are non-negotiable. Practically, that meant ticketed approvals for production changes, quarterly access recertification, and immutable audit logs for privileged actions. In healthcare contexts, HIPAA pushed us to tighten PHI access and encryption, plus stricter monitoring for anomalous reads. I’m comfortable translating compliance requirements into technical controls without turning delivery into molasses.”
Common mistake: Name-dropping HIPAA/PCI/SOC 2 without explaining what you actually did differently.
Q: How do you approach schema migrations in a zero-downtime environment?
Why they ask it: They want to know if you can protect availability while the product keeps shipping.
Answer framework: Expand–Migrate–Contract (a.k.a. parallel change). Emphasize backward compatibility.
Example answer: “I use an expand–migrate–contract approach: add new columns/tables in a backward-compatible way, deploy code that writes both or reads from the new path, backfill data in batches, then remove old structures once traffic is fully migrated. I pay attention to locks, long transactions, and replication lag. If the change is risky, I insist on feature flags and a rollback plan.”
Common mistake: Planning a ‘big bang’ migration on a large table during peak hours.
Q: What monitoring signals do you consider non-negotiable for production databases?
Why they ask it: They want to see if you can detect failure early and reduce MTTR.
Answer framework: Golden signals + database-specific signals. Tie each signal to an action.
Example answer: “I always want visibility into latency, error rate, saturation, and throughput—but mapped to database realities: replication lag, lock waits, connection pool saturation, IO latency, buffer cache hit ratio (with context), and slow query volume. Alerts should be actionable, not noisy, so I tune thresholds and include runbook links. The goal is to catch degradation before customers do.”
Common mistake: Listing dozens of metrics without explaining which ones trigger which response.
Q: A tool fails: your backup job reports success, but the backup file is corrupted. What do you do?
Why they ask it: They’re testing whether you can handle “silent failure,” the scariest kind.
Answer framework: Contain → Verify → Restore path → Prevent. Speak calmly and sequentially.
Example answer: “First I assume we’re exposed and I communicate that risk to the right channel. I verify the extent: which backups are affected, for how long, and whether replicas or snapshots are intact. Then I immediately run a restore test from the most recent known-good backup and confirm data integrity. Finally, I fix the pipeline—checksum validation, storage health checks, and alerting on restore-test failures—so ‘success’ means ‘restorable.’”
Common mistake: Treating it as a one-off and not adding restore validation to prevent recurrence.