4) Technical and professional questions (what separates prepared candidates)
Technical rounds in the US are rarely pure theory. Even when they ask about bias-variance or p-values, they’re really asking: can you make good decisions under constraints, and can you explain them to people who don’t care about your loss function?
Q: Walk me through how you would design an A/B test for a new ranking algorithm.
Why they ask it: They’re testing experiment design, metrics, power, and failure modes like interference and logging changes.
Answer framework: PREP: Primary metric → Risks/guardrails → Experiment design → Power & duration.
Example answer: “I’d start by defining the primary metric—e.g., long-term engagement proxy like 7-day retention or session depth—then add guardrails like latency and error rate. I’d choose randomization at the user level to reduce contamination, and I’d confirm consistent exposure logging. For power, I’d use baseline variance and MDE to estimate duration, and I’d pre-register the analysis plan including how we handle multiple metrics. Finally, I’d plan a ramp with a holdout to detect novelty effects.”
Common mistake: Ignoring instrumentation and guardrails.
Q: In SQL, how would you compute 7-day retention by signup cohort?
Why they ask it: They want to see if you can translate product questions into correct joins, windows, and definitions.
Answer framework: Define entities first (user, signup date, activity date), then outline query logic.
Example answer: “I’d build a cohort table with each user’s signup_date, then join to an events table filtered to activity between signup_date+1 and signup_date+7. I’d count distinct active users per cohort and divide by cohort size. I’d be careful about time zones and whether ‘day’ is calendar day or 24-hour windows. If the events table is huge, I’d pre-aggregate daily active flags.”
Common mistake: Forgetting to define retention window precisely.
Q: When would you choose logistic regression over gradient-boosted trees for a classification problem?
Why they ask it: They’re testing your model selection logic, interpretability needs, and operational constraints.
Answer framework: Constraints-first: interpretability, latency, data size, nonlinearity, and monitoring.
Example answer: “If stakeholders need a stable, explainable model and the relationship is roughly linear in log-odds, I’ll start with logistic regression. It’s fast, easier to calibrate, and simpler to monitor. If I see strong nonlinearities or interactions and I can afford more complexity, I’ll move to boosted trees and then invest in calibration and explanation tooling. I also consider retraining cadence—simple models are cheaper to keep healthy.”
Common mistake: Treating ‘more complex’ as automatically ‘better.’
Q: How do you handle imbalanced classes in a fraud or rare-event model?
Why they ask it: They want to see if you understand metrics, sampling, and thresholding in high-cost error settings.
Answer framework: 4-part approach: metric choice → data strategy → model strategy → threshold/business policy.
Example answer: “I’d avoid accuracy and focus on PR-AUC, recall at fixed precision, or cost-based metrics. On data, I might use stratified sampling for training but evaluate on the true distribution. On modeling, I’d try class weights or focal loss depending on the algorithm. Then I’d set thresholds based on operational capacity—like how many cases the review team can handle—and monitor drift because base rates change.”
Common mistake: Reporting ROC-AUC only and calling it done.
Q: Explain data leakage with a real example you’ve seen (or could plausibly happen).
Why they ask it: Leakage is a classic “looks great in offline, fails in production” trap.
Answer framework: Definition → Example → Prevention checklist.
Example answer: “Leakage is when training data includes information that wouldn’t be available at prediction time. I’ve seen it when features were built from post-event timestamps—like using ‘days since last support ticket’ when the ticket was created after the churn event. To prevent it, I build features with point-in-time correctness, use time-based splits, and run leakage tests that compare feature availability windows.”
Common mistake: Only describing leakage as ‘using the target column.’
Q: What’s your approach to model monitoring in production?
Why they ask it: They’re testing whether you can keep models alive: drift, performance decay, and alert fatigue.
Answer framework: Three layers: data health, model health, business health.
Example answer: “I monitor data freshness, schema changes, missingness, and feature distribution drift. Then I track model outputs—score distributions, calibration, and, when labels arrive, performance by segment. Finally, I tie it to business metrics like conversion or loss rate to catch silent failures. I set alert thresholds with runbooks, and I prefer ‘actionable’ alerts over noisy ones.”
Common mistake: Monitoring only offline metrics and ignoring data pipelines.
Q: Describe how you would build a feature store or decide not to build one.
Why they ask it: This is an insider question—US teams care about reuse, consistency, and time-to-ship.
Answer framework: Build vs. buy vs. avoid: start with pain points, then governance and ROI.
Example answer: “If multiple teams repeatedly rebuild the same features and we have training/serving skew issues, a feature store can pay off. I’d start small: define a few high-value features with ownership, SLAs, and point-in-time correctness. If the org is small and models are few, I’d avoid heavy infrastructure and instead standardize feature pipelines with versioning and tests. The goal is consistency, not a shiny platform.”
Common mistake: Proposing a feature store as a default ‘maturity’ step.
Q: Which tools do you use for end-to-end work—data, modeling, and deployment—in the US market?
Why they ask it: They want practical familiarity with common stacks and how you choose tools.
Answer framework: Toolchain story: ingest → transform → train → deploy → monitor.
Example answer: “For data, I’m comfortable with SQL plus Snowflake or BigQuery, and I’ve used dbt for transformations. For modeling, I use Python with pandas, scikit-learn, and PyTorch when needed. For tracking, MLflow or Weights & Biases; for orchestration, Airflow; and for deployment, batch scoring via Spark or real-time via a containerized service on AWS. I choose based on latency needs, team skills, and operational support.”
Common mistake: Listing tools without showing where they fit in the lifecycle.
Q: How do you ensure your work aligns with US privacy expectations and regulations (e.g., CCPA/CPRA)?
Why they ask it: US companies—especially consumer tech—worry about privacy, consent, and data minimization.
Answer framework: Privacy-by-design: data minimization, purpose limitation, access controls, and documentation.
Example answer: “I start by confirming we have a legitimate purpose and the right consent for the data. I minimize sensitive fields, prefer aggregated or pseudonymized data, and document feature definitions and retention. I also partner with legal/privacy early when models touch targeting or personalization. If we can achieve the same outcome with less sensitive data, I’ll push for that.”
Common mistake: Treating privacy as a legal team problem only.
Q: What would you do if your training pipeline suddenly fails the night before a launch decision?
Why they ask it: They’re testing incident response, prioritization, and whether you can deliver a decision under pressure.
Answer framework: Triage → Fallback → Communicate. Be explicit about what you’d ship and what you’d postpone.
Example answer: “First I’d identify whether it’s data freshness, credentials, schema drift, or compute. If the fix is risky, I’d use a fallback: last known good model, a simpler baseline, or a rules-based policy for the launch decision. I’d communicate clearly: what’s broken, what’s reliable, and the risk of proceeding. Then I’d schedule a postmortem and add tests so the failure becomes detectable earlier.”
Common mistake: Staying silent and trying to hero-fix without a fallback plan.