4) Technical and professional questions (what separates prepared from lucky)
Technical rounds for a Data Engineer in the United States are usually practical. They want to see you reason about data modeling, distributed systems, and cloud economics—then communicate it like you’re on-call for it.
Q: Walk me through how you’d design a lakehouse-style architecture for analytics and ML.
Why they ask it: They’re testing system design: zones, governance, performance, and team workflows.
Answer framework: “Layers + Contracts”: raw/bronze → cleaned/silver → curated/gold, with explicit contracts and ownership.
Example answer: “I’d separate storage and compute, land immutable raw data with lineage, then build standardized cleaned datasets with schema enforcement and PII handling. Curated ‘gold’ models are consumer-facing with documented definitions and tests. I’d enforce contracts at ingestion, version schemas, and keep transformations in code review with CI. For performance, I’d partition by access patterns and use incremental processing wherever possible.”
Common mistake: Drawing boxes without explaining governance, ownership, and how changes propagate safely.
Q: In SQL, how would you deduplicate events when you can get late-arriving data and retries?
Why they ask it: They’re testing whether you understand idempotency and event-time realities.
Answer framework: “Key + Window + Watermark”: define a stable dedupe key, use window functions, handle lateness.
Example answer: “I’d define a deterministic event_id or a composite key like user_id + event_type + event_timestamp + source_sequence. Then I’d use ROW_NUMBER() over that key ordered by ingestion_time desc and keep rn=1. For late arrivals, I’d process by event_time with a watermark and allow updates within a defined lookback window. I’d also track duplicates as a metric—if it spikes, something upstream changed.”
Common mistake: Using DISTINCT and calling it done, which hides problems and can drop legitimate records.
Q: Explain how you tune Spark jobs for performance and cost.
Why they ask it: They’re testing distributed systems intuition, not memorized configs.
Answer framework: “Diagnose → Fix → Validate”: read the Spark UI, identify skew/shuffle, apply targeted changes, measure.
Example answer: “I start with the Spark UI to see where time is spent—shuffles, GC, skewed tasks. Common fixes are reducing shuffle with broadcast joins when appropriate, repartitioning on join keys, and avoiding UDFs in hot paths. I’ll also tune file sizes and partition counts to match cluster resources. Then I validate with before/after metrics: runtime, cost per run, and output correctness.”
Common mistake: Randomly changing executor memory/cores without evidence from the UI.
Q: When would you choose an ELT approach vs. ETL, and what changes operationally?
Why they ask it: They’re testing modern warehouse patterns and how you think about ownership.
Answer framework: “Where compute lives + who debugs”: decide based on scale, governance, and team skill.
Example answer: “I prefer ELT when the warehouse can handle transformations efficiently and we want analysts to contribute via versioned models. ETL makes sense when transformations must happen before landing—for example, heavy parsing, encryption, or strict data minimization. Operationally, ELT shifts more logic into warehouse jobs and requires strong modeling discipline, testing, and cost controls. ETL often means more custom code and more moving parts in the pipeline layer.”
Common mistake: Treating ELT as ‘no engineering needed’ and ignoring testing and cost.
Q: How do you implement data quality checks that actually catch issues without spamming alerts?
Why they ask it: They’re testing whether you can balance sensitivity with signal.
Answer framework: “Critical metrics first”: start with a small set of high-value checks, then expand.
Example answer: “I start with checks tied to business impact: primary key uniqueness, referential integrity for core dimensions, freshness SLAs, and anomaly detection on key measures. I set thresholds based on historical distributions, not vibes, and I route alerts by severity. I also track false positives and tune thresholds over time. The goal is that when an alert fires, people believe it.”
Common mistake: Adding dozens of checks on day one and training everyone to ignore alerts.
Q: What’s your strategy for schema evolution in streaming pipelines?
Why they ask it: They’re testing whether you can keep pipelines stable as producers change.
Answer framework: “Contract + Compatibility”: define schema ownership, enforce compatibility rules, version changes.
Example answer: “I prefer a schema registry with compatibility rules—backward compatible changes by default. Producers version schemas and can add optional fields without breaking consumers. On the consumer side, I parse defensively and keep raw payloads for reprocessing. For breaking changes, I run parallel topics/streams and migrate consumers with a clear deprecation window.”
Common mistake: Letting producers ship breaking changes and expecting downstream teams to ‘just fix it.’
Q: You’re asked to backfill two years of data. How do you do it safely?
Why they ask it: They’re testing operational discipline: backfills can destroy budgets and trust.
Answer framework: “Plan–Isolate–Verify”: isolate compute, throttle, validate with reconciliation.
Example answer: “I’d first define the target state and acceptance checks—row counts, aggregates, and spot checks against source. Then I’d run the backfill in an isolated environment or dedicated queue to protect daily SLAs. I’d process in partitions with checkpoints so we can resume, and I’d publish progress metrics. After completion, I’d reconcile outputs and only then switch consumers or rebuild downstream models.”
Common mistake: Running a massive backfill on the same cluster as production jobs with no throttle or validation.
Q: What would you do if Airflow (or your orchestrator) goes down during a critical load window?
Why they ask it: They’re testing incident response and whether you can keep the business running.
Answer framework: “Stabilize → Restore → Prevent”: stop the bleeding, restore service, then fix root cause.
Example answer: “First I’d assess blast radius: which critical datasets will miss SLAs and who needs to know. If possible, I’d run the most critical jobs manually using documented commands or a fallback runner, while the platform team restores the scheduler. Once stable, I’d backfill missed partitions and verify downstream tables. Then I’d address root cause—HA setup, database health, and runbooks—so the next outage is shorter and less chaotic.”
Common mistake: Focusing only on restarting the tool and forgetting stakeholder communication and data validation.
Q: How do you handle PII and access control in a US environment?
Why they ask it: They’re testing governance maturity and awareness of common US compliance expectations.
Answer framework: “Classify → Minimize → Control → Audit”: classify data, reduce exposure, enforce least privilege, log access.
Example answer: “I start by classifying fields—PII, sensitive, non-sensitive—and minimizing what we ingest and retain. Then I enforce least-privilege access with role-based controls, masking or tokenization where appropriate, and separate environments for dev vs. prod. I make sure access is auditable and reviewed regularly. Depending on the company, I align controls with frameworks like SOC 2 and privacy laws like CCPA/CPRA.”
Common mistake: Saying ‘we just put it in a private bucket’ without field-level controls, auditing, or retention policies.
Q: What’s your approach to modeling: star schema, Data Vault, or something else?
Why they ask it: They’re testing whether you can choose a modeling strategy that fits the org.
Answer framework: “Consumers + Change rate”: pick based on who queries, how often sources change, and governance needs.
Example answer: “For BI-heavy use cases, I like dimensional modeling because it’s intuitive and performs well. If sources change frequently and we need strong lineage and auditability, Data Vault can be a good fit—especially for regulated domains. In practice, I often land raw immutable data, build a normalized core for stability, and expose dimensional marts for consumption. The key is consistency and documentation, not ideology.”
Common mistake: Treating modeling as a personal preference instead of a decision driven by consumers and change.
Q: As an ETL Developer or Data Pipeline Engineer, how do you prove a pipeline is correct?
Why they ask it: They’re testing whether you can validate end-to-end, not just unit test code.
Answer framework: “Test pyramid for data”: unit tests for transforms, integration tests for contracts, reconciliation for outputs.
Example answer: “I use unit tests for transformation logic, but I don’t stop there. I add contract tests at ingestion—schema, nullability, uniqueness—and I reconcile outputs against source totals and known benchmarks. For critical tables, I run canary loads and compare aggregates before full rollout. Correctness is a system property, so I validate at multiple layers.”
Common mistake: Only testing code paths and ignoring data drift, duplication, and source anomalies.