# Principal Software Engineer

**Company**: Microsoft
**Location**: Bengaluru
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Technology
**Ticker**: MSFT
**Wikidata**: https://www.wikidata.org/wiki/Q2283

**Apply**: https://microsoft.ai/job/principal-software-engineer-68/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_ec7883bb-e73

## Description

Modern ads platforms run on always-on, real-time data: streaming events, feature computation, near-real-time aggregations, and low-latency serving to power ML models that operate at massive scale under strict freshness, cost, and reliability requirements.

Microsoft Ads builds and operates large-scale, latency-sensitive systems that serve billions of requests. We are looking for a Principal Software Engineer who is hands-on with production coding and system design to build the real-time data pipelines and feature/embedding materialization systems that feed online stores/caches and integrate tightly with ML inference serving.

This role is ideal for engineers who enjoy:

- building robust streaming + ETL systems (correctness, idempotency, backfills, late data)

- owning SLOs with strong observability and operational maturity

- optimizing end-to-end performance and cost across compute, storage, and serving integrations

Primary success metrics are freshness, correctness, latency, reliability, and cost in production.

Responsibilities

- Design and implement real-time streaming ETL / feature pipelines (e.g., Flink or Spark Structured Streaming) that meet strict freshness and correctness constraints.

- Build and operate reliable messaging and ingestion with Kafka/Pulsar (partitioning strategy, retries, ordering guarantees, DLQs, backpressure handling).

- Own data contracts between producers, pipelines, and consumers: schema evolution, versioning, compatibility, validation, and safe rollout.

- Implement production-grade backfill/replay workflows

- Define and meet SLOs using OpenTelemetry/Prometheus/Grafana for metrics, tracing, dashboards, alerting, and incident response readiness.

- Integrate pipelines with online stores/caches and ML consumers (feature stores, embedding pipelines, LLM API calls, online/offline consistency patterns).

- Partner with applied scientists on feature/embedding definitions, validation, and end-to-end quality measurement.

- Optimize end-to-end performance and efficiency: CPU/memory/I/O, serialization, caching, network overhead, concurrency, and pipeline compute cost.

- Contribute to serving/inference integrations where needed (e.g., Triton/ONNX Runtime/TensorRT) including batching and latency/cost tradeoffs.

- Ship safely with CI/CD, automated testing (unit/integration/data quality), and operational playbooks/runbooks.

Qualifications

- Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, or a related field, with 8+ years of related experience.

- Strong programming skills in language C++, C# or Python (at least one required).

- Hands-on experience in one or more:

- Building and operating streaming data pipelines in production (Flink or Spark Structured Streaming)

- Distributed systems engineering with strong reliability and operational rigor

- Messaging systems such as Kafka/Pulsar

- Experience operating services with Kubernetes/containers and production readiness practices (deployments, scaling, rollbacks).

- Experience with observability stacks such as OpenTelemetry, Prometheus, Grafana.

- Ability to debug complex production issues using logs/metrics/traces and performance profiling.

- Strong communication and collaboration skills, with experience working across engineering, applied science/ML, and product/business stakeholders.

Preferred Qualifications

- Experience with feature stores, embedding pipelines, and online/offline consistency (freshness guarantees, correctness validation).

- Experience with data lakehouse/table formats and optimizations eg partitioning, compaction, and incremental processing.

- Experience with GPU inference serving (Triton, ONNX Runtime/TensorRT) and performance techniques (batching, request shaping, tail-latency reduction).

- Understanding of pipeline correctness patterns: idempotency, dedup, watermarking, late data, exactly-once vs at-least-once tradeoffs.

- Background in cost/performance modeling, capacity planning, and reliability improvements for high-scale data platforms.

- Experience in Ads/search/recommendations or other high-scale systems where freshness, latency, and cost are jointly optimized.

## Skills

### Required
- C++
- Python
- Flink
- Spark Structured Streaming
- Kafka
- Pulsar
- OpenTelemetry
- Prometheus
- Grafana
- Kubernetes
- Distributed systems engineering
- Messaging systems

### Nice to have
- Feature stores
- Embedding pipelines
- Online/offline consistency
- Data lakehouse/table formats
- GPU inference serving
- Pipeline correctness patterns
- Cost/performance modeling
- Capacity planning
- Reliability improvements

---

Source: [Apply at microsoft.ai](https://microsoft.ai/job/principal-software-engineer-68/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
