4) Technical and professional questions (what separates prepared from lucky)
This is where US interviews get blunt. They’ll ask you to design something, debug something, and justify something. If you’ve done Data Pipeline Engineer or ETL Developer work, you’ll recognize the themes: idempotency, incremental loads, late data, schema drift, and cost.
Q: Walk me through how you design an incremental load that is idempotent.
Why they ask it: They’re testing whether reruns create duplicates or silent corruption.
Answer framework: “Keys–Watermarks–Merge strategy.” State the unique key, the watermark, and the upsert/merge approach.
Example answer: “I start by defining a stable business key and a change tracking field like updated_at or an ingestion timestamp. I load into a staging table partitioned by ingestion date, then MERGE into the target using the key and latest record logic. Reruns overwrite the same partitions or re-merge deterministically, so duplicates don’t accumulate. I also log row counts and checksum-like aggregates to detect drift.”
Common mistake: Saying “we just truncate and reload” without acknowledging cost, downtime, and downstream impact.
Q: In SQL, how would you find and fix duplicate records caused by a bad join?
Why they ask it: They want practical SQL debugging, not theory.
Answer framework: “Detect → explain → correct.” Show a query to detect duplicates, then the corrected join logic.
Example answer: “First I’d group by the natural key and look for count(*) > 1, then inspect which join introduced multiplicity by checking join cardinality. Often it’s a many-to-many join where a dimension isn’t unique. The fix is to dedupe the dimension with a window function or join on the correct grain, then add a uniqueness test so it can’t regress.”
Common mistake: Blaming ‘SQL being weird’ instead of naming grain and cardinality.
Q: How do you model data for analytics: star schema vs. wide table vs. Data Vault?
Why they ask it: They’re testing whether you can match modeling style to usage patterns.
Answer framework: “Workload-first.” Start with query patterns, then governance, then change rate.
Example answer: “If the primary consumers are BI tools with predictable dimensions, I lean star schema for performance and clarity. If the use case is exploratory and the team is small, a wide table can be pragmatic—if it’s well-documented and tested. For highly regulated environments or lots of source change, Data Vault can help with auditability, but it adds complexity, so I’d only use it when the benefits outweigh the overhead.”
Common mistake: Declaring one approach ‘best’ without context.
Tooling questions in the US often assume cloud. Even if the company is hybrid, they’ll want to know you can operate modern stacks.
Q: What’s your approach to orchestrating pipelines in Airflow (or a similar orchestrator)?
Why they ask it: They want to see if you build maintainable DAGs with clear dependencies and failure handling.
Answer framework: “DAG hygiene.” Cover scheduling, retries, idempotency, backfills, and alerting.
Example answer: “I keep DAGs thin: orchestration in Airflow, transformation logic in versioned code or dbt. I use task-level retries with exponential backoff, set sensible timeouts, and make tasks idempotent so reruns are safe. For backfills, I prefer partitioned runs and clear catchup behavior. And I wire alerts to the team’s on-call channel with runbook links, not just ‘task failed.’”
Common mistake: Treating Airflow as a place to write all business logic.
Q: How do you tune Spark jobs for performance and cost?
Why they ask it: Big Data Engineer roles often live or die on efficient distributed processing.
Answer framework: “Bottleneck triage.” Identify whether it’s shuffle, skew, I/O, or serialization.
Example answer: “I start with the Spark UI to see where time is spent—shuffles, skewed stages, or slow reads. Then I fix the root cause: repartitioning, salting skewed keys, using broadcast joins when appropriate, and pruning columns early. I also right-size executors and use autoscaling carefully, because throwing compute at a bad shuffle just burns money. Finally, I validate results and runtime across representative partitions.”
Common mistake: Only saying “increase the cluster size.”
Q: Explain how you handle schema evolution and schema drift in production.
Why they ask it: US teams hate silent breakage from upstream changes.
Answer framework: “Contract + detection + safe rollout.” Mention schema registry or contracts, alerts, and backward compatibility.
Example answer: “I treat schema as a contract: version it, validate it at ingestion, and alert on changes. For streaming, I’d use a schema registry and compatibility rules; for batch, I’d compare incoming schema to expected and quarantine unexpected fields. When changes are legitimate, I roll them out with backward-compatible defaults and update downstream models with tests. The goal is no silent null explosions.”
Common mistake: Letting pipelines ‘just fail’ without a plan for partial compatibility.
Here’s a question that experienced teams ask because it reveals whether you’ve actually operated a platform.
Q: What data quality checks do you implement beyond “not null”?
Why they ask it: They want to know if you can prevent believable-but-wrong data.
Answer framework: “Dimensions of quality.” Cover freshness, volume, distribution, referential integrity, and business rules.
Example answer: “I use not-null and uniqueness, but I also monitor freshness and volume anomalies per partition. I add distribution checks—like percentiles or category proportions—so we catch shifts that still pass constraints. Referential integrity between facts and dimensions is huge for BI trust. And for key metrics, I add business-rule tests like ‘refunds can’t exceed sales’ with tolerances.”
Common mistake: Only listing generic constraints and ignoring anomaly detection.
US companies will also test security and compliance thinking, especially in software and consumer data environments.
Q: How do you design access control for sensitive data in a warehouse?
Why they ask it: They’re testing least privilege, auditability, and practical governance.
Answer framework: “Classify → control → audit.” Mention PII tagging, role-based access, and logging.
Example answer: “I start by classifying fields—PII, SPI, and internal-only—and tagging them in the catalog. Then I implement role-based access at the schema/table level, and where needed, column-level masking or row-level security. I keep access requests ticketed and auditable, and I monitor query logs for unusual access patterns. The goal is to make the secure path the easy path.”
Common mistake: Saying “security is handled by IT” as if data engineering has no role.
Q: What US regulations or standards have you had to consider when handling customer data?
Why they ask it: They want to see if you understand compliance constraints that shape architecture.
Answer framework: “Context + controls.” Name the regulation and the concrete engineering controls you applied.
Example answer: “I’ve worked in environments influenced by SOC 2 requirements—access control, change management, and audit logs were non-negotiable. I’ve also had to support privacy expectations like CCPA-style deletion requests by designing datasets with clear identifiers and deletion workflows. Even when legal owns interpretation, I make sure the pipelines can prove who accessed what and when.”
Common mistake: Dropping acronyms without explaining what you changed in the system.
Now the failure question—because it always happens.
Q: A critical pipeline fails at 6 a.m. and the CEO dashboard refresh is at 8. What do you do?
Why they ask it: They’re testing operational judgment under time pressure.
Answer framework: “Restore service first, then correctness.” Communicate, triage, choose a safe fallback, then prevent recurrence.
Example answer: “First I’d assess blast radius: which datasets and dashboards are impacted and whether stale data is acceptable. I’d post a quick status update with ETA and a fallback plan, then triage logs to identify whether it’s source outage, credentials, or data anomaly. If the fix is risky, I’d publish the last known good partition with a clear ‘stale as of’ label rather than pushing potentially wrong data. After recovery, I’d write a postmortem and add monitoring so we catch it earlier next time.”
Common mistake: Rushing a hotfix that makes the numbers wrong just to hit a refresh time.
Finally, expect a question that tests whether you can think like a Data Platform Engineer, not just a query writer.
Q: How do you make a data platform observable—what do you monitor and why?
Why they ask it: They want engineers who reduce on-call pain with instrumentation.
Answer framework: “SLIs/SLOs for data.” Define freshness, completeness, latency, cost, and failure rate.
Example answer: “I monitor pipeline success rate, runtime, and queue time, but I also monitor data SLIs: freshness per table, volume per partition, and anomaly scores for key measures. I track cost drivers like bytes scanned and compute hours per job, because cost regressions are production incidents too. And I make alerts actionable: include the failing partition, upstream dependency, and a runbook link.”
Common mistake: Monitoring only infrastructure metrics and ignoring data correctness and freshness.