
SGLang on ECS: efficiently serving leading OSS LLMs on AWS
Benefits of SGLang as a runtime environment for OSS LLM. Sharing a Docker image to run it on ECS. Reasoning inference examples with Qwen QwQ 32-B.

docker pull didierdurand/lic-sglang:amzn2023-latest
The image that we publish is built directly on GitHub via this GitHub Action. You can see corresponding executions on this page- The image is based on Amazon Linux 2023. But, it can be adapted to other flavors of Linux: Ubuntu, RedHat, etc.
- The SGLang project is still in activedevelopment with some features or parameters yet to be added. So, we copy into the image at build time a bash shell customize_sglang.sh that allows for customization. For example, as of this writing, we update some http timeout parameter in the source code via
sed 's/timeout_keep_alive=5,/timeout_keep_alive=500,/' -i $FILE
for file forFILE='/usr/local/lib/python3.12/site-packages/sglang/srt/entrypoints/http_server.py
. You can add your own customizations in this shell. - We also copy into the image a start_sglang.sh to dynamically build the start SGLang command from env variables received from the
docker run
command. Different models have different requirements for the various parameters proposed by SGLang. It allows to keep the launch parameters external to the image: the same Docker image can be used for multiple LLMs. - It is unsustainable to include the weights of the LLM into the image: they are most often too big (60B+ for QwQ-32B for example) and they would tie the image to a specific LLM. The latency of a live fetch is too long for live pull from HuggingFace at each start. So, we use a Docker Bind Mounts at container start to link the
/home/model
directory of the image to an external directory of the home server, where the model weights are stored, in our fast on a fast AWS EBS volume. - We define multiple Docker ENV variables to collect parameters required to issue the right start command for SGLang. Those variables will be populated via
--env
option. (doc for details) - the final
|| sleep infinity
is a trick to keep the container up & running even if the SGLang start command fails for any reason. It allows to connect to the container viadocker exect -it <container-id> /bin/bash
to debug the problem.
Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=6.09 GB, mem usage=15.54 GB.
show that the total model (60B+) is loaded in 4 equal parts on each GPU.How many letters R in word 'strawberry' ?
- S
- T
- R
- A
- W
- B
- E
- R
- R
- Y
- The third letter is R.
- Then after E (7th), the next two letters are R and R. So that's two more. So total 1 + 2 = 3. So three R's. Hmm, but I've heard sometimes people might think there are two. Maybe a common mistake? Let me check again. Let me spell it slowly: S-T-R-A-W-B-E-R-R-Y. So after the B and E comes two R's in a row, then Y. So that's two R's after the E, plus the one after T. So total three. Therefore, the answer is 2? Wait no, three. Wait, maybe I'm confused. Let me count the Rs only:
- S – no
- T – no
- R – yes (1)
- A – no
- W – no
- B – no
- E – no
- R – yes (2)
- R – yes (3)
- Y – no
- S
- T
- R
- A
- W
- B
- E
- R
- R
- Y
A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?
- Blue socks: 21
- Red socks: 17
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.