4) Technical and professional questions (the real filter)
This is where US interviewers separate “I’ve used Snowflake” from “I can design a platform that won’t collapse at 10x volume.” Expect follow-ups. If you give a generic answer, they’ll keep drilling until the gaps show.
Q: Design a modern analytics platform for product events and operational data. What architecture would you choose and why?
Why they ask it: They’re testing end-to-end thinking: ingestion, storage, modeling, governance, and consumption.
Answer framework: “Layered architecture walkthrough” (sources → ingestion → storage → transform → serve → govern) with explicit tradeoffs.
Example answer: “I’d start with a landing zone for raw data—immutable, partitioned, and versioned—then a curated layer with standardized schemas and quality checks, and finally purpose-built marts for BI and ML. For ingestion, I’d mix CDC for operational databases and streaming for events, with clear SLAs per domain. I’d choose a lakehouse-style approach if we need both BI and ML on shared data, but I’d still enforce contracts and ownership so the lake doesn’t become a swamp. The key is governance baked into CI/CD: catalog, lineage, tests, and access controls.”
Common mistake: Jumping straight to vendor names without explaining layers, SLAs, and ownership.
Q: How do you decide between a star schema, Data Vault, and a wide-table approach for analytics?
Why they ask it: They want to see modeling judgment, not ideology.
Answer framework: “Context matrix.” Compare by change rate, audit needs, consumer skill level, and query patterns.
Example answer: “If the business needs stable BI with conformed dimensions, I lean star schema because it’s understandable and performant. If sources are volatile and auditability/history are critical—like finance or complex integrations—Data Vault can be a good backbone, but I’d still publish marts on top for usability. Wide tables can work for a narrow set of high-value dashboards, but I treat them as serving artifacts, not the core model, because they tend to duplicate logic and hide grain issues.”
Common mistake: Declaring one model ‘best’ without tying it to requirements.
Q: Walk me through how you handle slowly changing dimensions (SCD) and historical correctness.
Why they ask it: They’re testing whether you can support finance-grade reporting and reproducibility.
Answer framework: “Grain → keys → change capture → query impact.”
Example answer: “I start by defining the grain and natural keys, then choose SCD type based on use case—Type 2 for historical analysis, Type 1 for corrections, and sometimes hybrid. I prefer explicit effective_start/effective_end timestamps and a current_flag, plus surrogate keys for joins. I also document which facts should join ‘as of’ event time versus current state, because that’s where many teams accidentally rewrite history.”
Common mistake: Treating SCD as a purely technical pattern and ignoring business semantics.
Q: What’s your approach to data quality—specifically, what do you test and where do you enforce it?
Why they ask it: They want to know if you can prevent silent failures.
Answer framework: “Quality pyramid” (schema/contract, freshness, volume, validity, reconciliation) + automation.
Example answer: “I enforce schema contracts at ingestion where possible, then add freshness and volume anomaly checks at the pipeline level. In curated models, I test uniqueness, not-null, referential integrity, and accepted values for key dimensions. For critical metrics, I add reconciliation checks against source systems or financial statements. The point is to catch issues early and make failures loud—alerts, runbooks, and ownership.”
Common mistake: Only mentioning ‘unit tests’ without covering freshness, volume anomalies, and reconciliation.
Q: How would you design access control for PII in a US environment?
Why they ask it: They’re checking security maturity and whether you understand US compliance realities.
Answer framework: “Classify → control → audit.” Tie controls to data classification and least privilege.
Example answer: “First I classify data—PII, SPI, financial, public—and tag it in the catalog. Then I implement role-based access with least privilege, using column/row-level security where needed and masking for non-prod. I also separate duties: ingestion roles, transformation roles, and consumer roles. Finally, I ensure audit logs are retained and reviewed, and I align with relevant requirements like SOC 2 controls and, depending on the business, HIPAA or GLBA.”
Common mistake: Saying ‘we just restrict the schema’—without masking, auditing, or classification.
Q: What US standards or frameworks have you used to guide controls and audits (SOC 2, NIST, HIPAA, etc.)?
Why they ask it: They want to see if you can operate in regulated environments and speak to auditors.
Answer framework: “Name → apply → evidence.” Mention a framework, how it affected design, and what evidence you produced.
Example answer: “I’ve worked in a SOC 2 environment where we had to demonstrate access controls, change management, and monitoring. That influenced how we set up IAM roles, approvals for production changes, and logging for data access. We kept evidence through ticketing, CI/CD logs, and periodic access reviews. I’m also familiar with NIST concepts around risk-based controls, even when the company isn’t formally certified.”
Common mistake: Dropping acronyms without explaining what you actually implemented.
Q: You’re interviewing as a Cloud Data Architect. How do you control warehouse/lakehouse cost without killing performance?
Why they ask it: In the US, cost overruns are a fast way to lose trust.
Answer framework: “Cost levers” (workload isolation, scaling policy, storage layout, query governance) + measurement.
Example answer: “I isolate workloads—ELT, BI, ad-hoc, and ML—so one noisy group doesn’t spike costs for everyone. I use autoscaling with guardrails, plus query tagging and chargeback/showback so teams see their spend. On the data side, I optimize partitioning/clustering and avoid repeated full scans by using incremental models and materializations. And I set governance: approved marts for common metrics, and limits for runaway ad-hoc queries.”
Common mistake: Only saying ‘turn on autosuspend’—that’s table stakes.
Q: What tools have you used for orchestration and transformation (Airflow, dbt, Spark), and how do you decide?
Why they ask it: They’re mapping your experience to their stack and checking architectural reasoning.
Answer framework: “Workload fit.” Choose based on complexity, team skill, and operational burden.
Example answer: “I’ve used Airflow for complex dependency orchestration and operational workflows, and dbt for SQL-first transformations with strong testing and lineage. For heavy processing or streaming, I’ve used Spark where it’s justified by scale or non-SQL transformations. My decision rule is: keep it simple for the team—dbt for most analytics transforms, add Spark only when SQL can’t do it efficiently, and keep orchestration observable with clear SLAs.”
Common mistake: Treating tools as identity—‘I’m an Airflow person’—instead of fitting them to the problem.
Q: How do you handle schema evolution for event data (mobile/web tracking) without breaking downstream models?
Why they ask it: This is a real-world pain point; experienced Big Data Architects have scars here.
Answer framework: “Contract + versioning + quarantine.”
Example answer: “I define an event contract with required fields and a version, and I validate events at ingestion. New fields are allowed, but breaking changes require a new version and a migration plan. I also keep a quarantine path for malformed events so pipelines don’t silently drop data. Downstream, I model events with a stable core and optional attributes, and I monitor field-level null spikes to catch instrumentation regressions.”
Common mistake: Relying on ‘JSON is flexible’ and ignoring downstream breakage.
Q: Tell me about a time you migrated from on-prem to cloud, or from one warehouse to another. What did you do first?
Why they ask it: They want sequencing and risk management, not hero stories.
Answer framework: “Discover → parallel run → cutover.”
Example answer: “I started with inventory and lineage: what datasets exist, who uses them, and what SLAs matter. Then we built the target platform and ran critical pipelines in parallel with reconciliation checks. We migrated consumers in waves—starting with low-risk dashboards—while keeping a rollback plan. The final cutover happened only after we hit agreed accuracy thresholds and performance baselines.”
Common mistake: Talking only about copying data—ignoring consumers, reconciliation, and rollback.
Q: What would you do if the data platform fails during a critical reporting window (board deck, month-end close)?
Why they ask it: They’re testing incident leadership and operational maturity.
Answer framework: “Triage → stabilize → communicate → prevent.”
Example answer: “First I’d establish impact and scope—what pipelines, what datasets, what stakeholders. Then I’d stabilize: pause downstream jobs to prevent bad data propagation, switch to a known-good snapshot if available, and restore the minimal set needed for the reporting deadline. I’d communicate clearly in a single channel with ETAs and workarounds. Afterward, I’d run a blameless postmortem and implement prevention—better alerting, backfills, and runbooks.”
Common mistake: Going straight into technical debugging without mentioning stakeholder comms and containment.