4) Technical and professional questions (the real separators)
This is where US interviews get blunt. They’ll test whether you can reason about correctness, performance, and recoverability — and whether you can do it in the stack they actually run. Expect deep SQL, incremental patterns, and tool-specific questions (especially if the job post mentions SSIS Developer or Informatica Developer work).
Q: Walk me through how you design an incremental load when the source doesn’t have reliable timestamps.
Why they ask it: Incremental strategy is a core ETL skill, and unreliable sources are common.
Answer framework: “Keys–Change detection–Reconciliation”: pick a key, detect changes, validate completeness.
Example answer: If timestamps aren’t reliable, I look for a stable business key and a way to detect change, like a source CDC log, a version column, or a hash of relevant fields. If none exist, I’ll use a snapshot approach with partitioning and compare hashes to find deltas, then upsert into the target. I always add reconciliation checks — counts by partition and a sampled checksum — because the risk is silent data loss.
Common mistake: Saying “I’d just full refresh” without addressing scale, cost, and downstream impact.
Q: Explain how you handle slowly changing dimensions (Type 1 vs Type 2) and when you’d use each.
Why they ask it: They’re checking data warehousing fundamentals tied to real reporting needs.
Answer framework: Compare–Use case–Implementation details.
Example answer: Type 1 overwrites history — I use it for corrections where history doesn’t matter, like fixing a misspelling. Type 2 preserves history with effective dates and a current flag — I use it for attributes where “as of” reporting matters, like customer segment at time of purchase. Implementation-wise, I ensure a surrogate key, effective_start/effective_end, and logic to close out prior records on change.
Common mistake: Mixing up “audit history” with “transaction history” and applying Type 2 everywhere.
Q: How do you build idempotent ETL jobs?
Why they ask it: Retries happen. They want to know you won’t duplicate data.
Answer framework: “Deterministic inputs + safe writes”: define boundaries, then use merge/overwrite safely.
Example answer: I make each run deterministic by defining a clear processing window or partition, like a date or batch ID. Then I write in a way that can be repeated: overwrite a partition, or use a MERGE keyed on natural keys plus event time, with dedupe rules. I also store run metadata so we can reprocess a specific batch without guessing.
Common mistake: Relying on “we rarely rerun” instead of engineering for reruns.
Q: What data quality checks do you consider non-negotiable in production pipelines?
Why they ask it: They want to see if you think like an owner, not a script writer.
Answer framework: “Contract checks + anomaly checks + reconciliation.”
Example answer: I start with contract checks: schema changes, required fields not null, and key uniqueness. Then anomaly checks: volume shifts, distribution changes, and outliers on critical metrics. Finally reconciliation: source-to-target counts by partition and a few business totals that should tie out. The goal is catching issues early, before a VP sees a broken dashboard.
Common mistake: Only checking row counts and calling it “data quality.”
Q: In SQL, how would you deduplicate events when you can get late-arriving duplicates?
Why they ask it: This is a day-to-day ETL reality, and it tests window functions and business logic.
Answer framework: “Define uniqueness + choose winner + implement with window functions.”
Example answer: I define a uniqueness key, often (event_id) or a composite like (customer_id, event_type, event_timestamp, source_system). Then I choose the winning record using a deterministic rule — latest ingestion time, highest version, or a priority flag. In SQL I’d use ROW_NUMBER() over the partition by the uniqueness key ordered by the winner rule, then filter to row_number = 1.
Common mistake: Deduping on too few columns and accidentally collapsing legitimate events.
Q: What’s the difference between ETL and ELT, and how does it change your design?
Why they ask it: They want to see if you can operate as an ELT Developer in modern cloud stacks.
Answer framework: “Where transforms run + implications”: compute, governance, lineage, cost.
Example answer: In ETL, you transform before loading into the warehouse; in ELT, you load raw data first and transform inside the warehouse/lakehouse. ELT changes design because you typically keep raw/staging layers, rely on warehouse compute for transformations, and emphasize lineage and access controls so raw data doesn’t become a free-for-all. It also shifts cost management — heavy transforms can get expensive if you don’t optimize.
Common mistake: Treating ELT as “ETL but trendy” without discussing governance and cost.
Q: If you were hired as an SSIS Developer, what are the first performance levers you’d check in a slow package?
Why they ask it: Tool-specific depth; they want someone who’s actually tuned SSIS.
Answer framework: “Bottleneck hunt”: source, transform, destination, then SSIS settings.
Example answer: I’d start by isolating whether the bottleneck is the source query, the transformations, or the destination writes. In SSIS specifically, I check data flow buffer sizes, DefaultBufferMaxRows/Size, and whether we’re doing blocking transforms like Sort/Aggregate unnecessarily. For destinations, I look at fast load settings, batch size, and indexes/constraints on the target. Then I validate with SSIS logging and SQL execution plans.
Common mistake: Tweaking random SSIS properties before proving where the time is going.
Q: If you were hired as an Informatica Developer, how do you design for reusability and maintainability?
Why they ask it: They’re testing whether you build scalable mappings/workflows, not one-offs.
Answer framework: “Parameterize + modularize + standardize.”
Example answer: I use parameters and variables for environment-specific values, keep mappings modular, and standardize naming conventions so lineage is readable. I also separate reusable transformations (like standard cleansing) from subject-area logic, and I build restartability into workflows with checkpoints. That way, when a new source comes in, we reuse patterns instead of cloning spaghetti.
Common mistake: Copy-pasting mappings for each new feed and creating maintenance debt.
Q: How do you handle PII in ETL pipelines in the US (masking, access, and compliance)?
Why they ask it: US companies are sensitive to privacy and security; they want practical controls.
Answer framework: “Classify–Minimize–Protect–Audit.”
Example answer: First I classify which fields are PII and whether they’re needed downstream. I minimize by not loading unnecessary PII into analytics zones, and I protect what remains with encryption in transit/at rest, role-based access, and masking/tokenization where appropriate. I also ensure auditability — who accessed what — and align with internal policies and applicable laws like HIPAA for healthcare or GLBA for financial services.
Common mistake: Saying “we’re compliant” without naming concrete controls.
Q: What US regulation or standard have you worked under that affected your data pipelines?
Why they ask it: They want evidence you can build within constraints (audit trails, retention, access).
Answer framework: Context–Requirement–Implementation–Proof.
Example answer: In a healthcare project, HIPAA requirements influenced how we handled PHI: we restricted access to raw zones, logged access, and avoided copying PHI into broad BI datasets. We implemented column-level masking for identifiers and ensured secure transfer and storage. We also documented data flows for audits and kept retention policies aligned with the organization’s compliance team.
Common mistake: Name-dropping a regulation without explaining what you changed in the pipeline.
Q: A critical load fails at 5 a.m. and the CFO dashboard refresh is at 7 a.m. What do you do?
Why they ask it: This is the job. They’re testing triage, communication, and recovery.
Answer framework: Triage–Stabilize–Communicate–Recover–Prevent.
Example answer: I’d first identify whether it’s a source outage, transformation error, or destination issue by checking orchestration logs and the last successful checkpoint. If there’s a safe partial recovery path, I’ll prioritize the minimum dataset needed for the CFO dashboard and run a targeted backfill. In parallel, I notify stakeholders with a clear ETA and what’s impacted. After the deadline, I do the full fix and add a guardrail — better alerting, retries, or a fallback dataset.
Common mistake: Going silent while debugging, then surprising everyone at 6:55 a.m.