4) Technical and professional questions (the ones that decide the offer)
This is where US interviewers separate “can write a pipeline” from “can run production.” Expect deep dives into incremental loads, CDC, schema drift, orchestration, and how you test data. If the job description mentions SSIS Developer or Informatica Developer work, they’ll go specific—because those tools have very particular failure modes.
Q: Walk me through how you design an incremental load. What keys do you rely on, and how do you handle late-arriving data?
Why they ask it: Full reloads don’t scale; they want someone who can design for growth.
Answer framework: “3-layer” explanation: watermark strategy, change capture, reconciliation/backfill.
Example answer: “I start by identifying a reliable change signal: updated_at, CDC logs, or a monotonically increasing ID. I design the load to be idempotent—so reruns don’t duplicate—and I store the watermark in a control table. For late-arriving data, I use a lookback window and a reconciliation query that compares counts and key coverage between source and target. If the source can’t guarantee timestamps, I’ll push for CDC or build a hash-based change detection on a stable business key.”
Common mistake: Saying “use updated_at” as if it’s always trustworthy.
Q: ETL vs ELT—when would you choose each in a modern warehouse setup?
Why they ask it: They’re testing architecture judgment, not definitions.
Answer framework: Tradeoff triangle: cost, governance, and performance.
Example answer: “If the warehouse is strong and transformations are mostly relational, ELT Developer patterns make sense: land raw data, transform in-warehouse, and keep lineage and tests close to the models. I’ll still do ETL when I need heavy pre-processing, sensitive data masking before landing, or when source constraints require it. The decision is usually about where you want compute, where you can enforce governance, and how quickly you need to iterate.”
Common mistake: Treating ELT as ‘new’ and ETL as ‘old’ instead of choosing based on constraints.
Q: How do you test ETL pipelines? Give examples of data quality checks you actually implement.
Why they ask it: They want to know if you prevent incidents or just react to them.
Answer framework: Testing pyramid for data: unit (transform logic), integration (pipeline), and monitoring (production).
Example answer: “I test at three levels. For transformations, I validate business rules with small fixtures—like ensuring refunds don’t count as revenue. For pipeline integration, I check row counts, key uniqueness, and referential integrity between fact and dimension tables. In production, I monitor freshness, volume anomalies, and distribution shifts, and I alert on thresholds that match the business cadence. The goal is catching silent failures before a VP sees a broken dashboard.”
Common mistake: Only mentioning ‘unit tests’ without any production monitoring.
Q: Explain how you handle schema drift from upstream sources.
Why they ask it: Schema drift is constant in US SaaS-heavy environments.
Answer framework: Detect → Decide → Deploy: detection mechanism, policy, and rollout.
Example answer: “I detect drift with automated schema comparisons on ingestion and fail fast for breaking changes. Then I decide based on a policy: additive columns can be auto-accepted into a raw layer, but type changes or dropped columns require review. I keep a versioned schema registry or at least a tracked DDL history, and I communicate changes in release notes. For critical tables, I prefer contracts with upstream teams so drift becomes an exception, not a surprise.”
Common mistake: Auto-accepting everything and letting downstream models silently change.
Q: In SQL, how would you deduplicate records when you have multiple updates per business key?
Why they ask it: This is a daily ETL Developer task, and they want clean, deterministic logic.
Answer framework: State the rule, then the pattern (window function / max timestamp / tie-breaker).
Example answer: “I define the ‘winner’ record—usually the latest updated_at, and if there’s a tie, the highest sequence number or ingestion timestamp. Then I use a window function like ROW_NUMBER() over (PARTITION BY business_key ORDER BY updated_at DESC, ingest_ts DESC) and filter to row_number = 1. I’ll also keep the full history in a raw or audit table if the business needs it.”
Common mistake: Using DISTINCT and hoping it solves duplicates.
Q: What’s your approach to CDC (Change Data Capture), and what can go wrong?
Why they ask it: CDC is common in enterprise US stacks and easy to mess up.
Answer framework: Mechanism → Semantics → Failure modes.
Example answer: “I’ve used CDC via database logs and via tool-managed connectors. The key is understanding semantics: inserts, updates, deletes, and how to represent them downstream—especially deletes. Common failure modes are missing events during connector downtime, out-of-order events, and schema changes that break parsing. I mitigate with checkpointing, replay capability, and periodic reconciliation against source-of-truth counts.”
Common mistake: Talking about CDC like it’s just ‘incremental loads’ without delete handling.
Q: If you’ve worked as an SSIS Developer: how do you handle package configuration, deployments, and environment differences?
Why they ask it: They want operational maturity, not just dragging components in a designer.
Answer framework: Configuration strategy → Deployment model → Observability.
Example answer: “I separate configuration from code using SSIS parameters and environment variables, and I keep connection strings and secrets out of the package. For deployments, I prefer the SSIS Catalog with project deployment, versioned builds, and consistent environments for dev/test/prod. I also standardize logging—row counts, error outputs, and execution reports—so troubleshooting doesn’t require guessing.”
Common mistake: Hardcoding connection strings or manually editing packages in production.
Q: If you’ve worked as an Informatica Developer: how do you tune a slow mapping?
Why they ask it: Informatica performance tuning is a specific skill employers still pay for.
Answer framework: Source → Transform → Target: isolate where time is spent, then optimize in order.
Example answer: “I start by checking whether the bottleneck is source extraction, transformation, or target load. Then I push filters and joins down to the database when possible, reduce data early, and review cache sizes for lookups. I’ll also tune session properties—commit intervals, partitioning, and bulk load options—based on the target. Finally, I validate with runtime metrics so we’re not tuning by superstition.”
Common mistake: Randomly changing session settings without measuring where time is actually going.
Q: How do you manage PII in pipelines in the United States, and what standards do you follow?
Why they ask it: They’re testing whether you understand compliance realities (privacy + security) in US environments.
Answer framework: Identify → Minimize → Protect → Audit.
Example answer: “First I classify fields—PII, PHI, PCI—and confirm who is allowed to see what. I minimize exposure by masking or tokenizing before data hits broad-access layers, and I enforce least-privilege roles in the warehouse. I also encrypt in transit and at rest and keep audit logs for access and changes. Depending on the domain, I align controls to SOC 2 expectations and, if it’s healthcare, HIPAA requirements.”
Common mistake: Saying “we just don’t store PII” when the pipeline clearly touches customer data.
Q: A critical load fails at 6 a.m. and the CFO’s dashboard refresh is at 7 a.m. What do you do?
Why they ask it: This is the real job: triage under time pressure.
Answer framework: Triage playbook: contain, diagnose, restore, communicate, prevent.
Example answer: “I’d first check whether the failure is upstream availability, credentials, data volume anomaly, or transformation error, and I’d look at the last successful checkpoint. If we can rerun safely, I’ll do a targeted rerun from the last good stage rather than a full reload. In parallel, I’ll message stakeholders with an ETA and whether numbers will be partial. After recovery, I’ll write a short postmortem and add a guardrail—like upstream SLA checks or better alerting—so it’s less likely next time.”
Common mistake: Going silent while you debug, then surprising everyone at 7:05.