# Associate Director, Software Engineering (Model Hosting/Inference Optimisation)

**Company**: HSBC
**Location**: Shenzhen, Guangdong, China / Guangzhou, Guangdong Province, China
**Work arrangement**: onsite
**Experience**: senior
**Job type**: full-time
**Category**: Engineering
**Industry**: Finance

**Apply**: https://portal.careers.hsbc.com/careers/job/563774611071743?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply
**Canonical**: https://yubhub.co/jobs/job_a41707dc-df0

## Description

We are seeking an experienced professional to join our team in the role of Associate Director, Software Engineering (Model Hosting/Inference Optimisation).

As a key member of our CTO Platforms (AI Platforms) team, you will design, build, and operate scalable, reliable model hosting platforms for LLMs, embeddings, and STT/TTS across heterogeneous hardware.

Key responsibilities include:

- Designing and implementing scalable model hosting platforms for LLMs, embeddings, and STT/TTS across heterogeneous hardware.

- Driving inference optimisation for latency, throughput, and cost through techniques such as quantisation, KV-cache optimisation, and dynamic/continuous batching.

- Evaluating, integrating, and tailoring inference frameworks (e.g., vLLM, TensorRT-LLM, SGLang) to maximise performance on target hardware.

- Owning inference health and performance monitoring, including latency, throughput, TTFT, memory, availability, and troubleshooting bottlenecks and deployment issues.

- Partnering with hardware teams to apply hardware-specific optimisations and improve resource utilisation.

- Ensuring hosting systems meet production standards for reliability, scalability, security, and high availability.

Requirements include:

- A Bachelor's, Master's, or PhD in ML/NLP/CS/Data Science/Statistics (or related).

- 3 years of experience on AI platforms, covering both model hosting/inference optimisation and fine-tuning pipelines; LLM experience strongly preferred.

- Strong engineering skills in Python and CUDA, with a solid understanding of GPU/CPU architecture and HPC fundamentals.

- Deep inference expertise, including KV-cache, batching, quantisation (INT4/FP8/GPTQ/AWQ), operator optimisation, and framework integration (vLLM, TensorRT-LLM, SGLang).

- End-to-end fine-tuning expertise, including data prep, distributed training, hyperparameter tuning, HF/Accelerate/LoRA/QLoRA, and benchmarking/monitoring/troubleshooting.

## Skills

### Required
- Python
- CUDA
- GPU/CPU architecture
- HPC fundamentals
- KV-cache
- batching
- quantisation
- operator optimisation
- framework integration
- data prep
- distributed training
- hyperparameter tuning
- HF/Accelerate/LoRA/QLoRA

---

Source: [Apply at portal.careers.hsbc.com](https://portal.careers.hsbc.com/careers/job/563774611071743?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply)
