Keep Spinning

KEEP SPINNING

HOME Keep Spinning Back

16.03 2025

Edgestar Inference Performance Test: Ollama Inference Server Performance with Llama3.1, Qwen2.5, and Deepseek-r1

Share：

Edgestar Inference Performance Test: Ollama Inference Server with Llama3.1, Qwen2.5, and Deepseek-R1

As the application of large language models (LLMs) in enterprises grows, selecting the right model is crucial for achieving high performance and reducing costs. Spingence’s Edgestar platform, as a high-performance edge inference solution, helps enterprises deploy and run various LLMs locally, ensuring exceptional performance across different use cases.

This article delves into the inference performance of the Edgestar platform using Ollama across different language models (Llama3.1, Qwen2.5, Deepseek-R1) and tests their performance under varying concurrent request levels. It aims to help you understand how to select the most suitable model to enhance your enterprise applications.

Test Environment and Conditions｜

LLM Inference Efficiency Metrics

When conducting inference tests, we focus on three key performance indicators:

TTFT (Time to First Token): The time taken from sending the request to receiving the first token, which is a key indicator of model response speed.

End-to-End Latency: The total time taken from the start of the request to the completion of the entire inference process, reflecting the model's overall processing delay.

FPS (Frames Per Second): The number of requests processed per second, which is a crucial metric for evaluating the system’s processing capability and throughput.

These metrics help assess the performance of different types of LLMs in a hardware environment and understand their operational efficiency in real-world applications.

LLM Inference Efficiency with Single Requests

Based on the parameter scale of the models, LLMs are divided into three categories: Lightweight Applications (less than 10B parameters), Medium-Scale Applications (less than 50B parameters) and Heavyweight Applications (greater than 70B parameters)

The test results show that when there is only a single request, the model’s parameter scale has a significant impact on inference efficiency. Generally, lightweight models perform faster with lower TTFT and latency, and higher TPS.

In contrast, heavyweight models require more computational resources, resulting in higher latency and lower TPS due to the increased number of parameters and more complex calculations.

Impact of Concurrent Requests on LLM Inference Performance

In enterprise applications, handling not just single requests but concurrent requests from multiple users is often required. For instance, in smart customer service systems, there might be concurrent requests from dozens or hundreds of users, demanding the system’s stability and efficiency in handling high-concurrency inference tasks.

Therefore, the test includes varying concurrent request numbers (1/2/4/8/16/32/64/128) to simulate real business environments and measure the performance of each model under high load.

The following test results reveal each model's base performance under different concurrent requests and compare how models handle higher loads as the number of concurrent requests increases.

1. TTFT（Time to First Token）

2. End-to-End Latency

3. Token per Second, TPS

Lightweight Models (less than 10B parameters):
1. With low request numbers (such as 1 and 2 requests), lightweight models perform excellently, maintaining low TTFT and latency, and stable TPS.
2. Even when the number of requests reaches 64 or 128, they can still maintain high TPS, showing good scalability.
3. Ideal for applications with high response speed requirements and handling fewer concurrent requests, such as smart customer service or basic text classification

Medium-Scale Models (less than 50B parameters):
1. Medium-scale models perform slightly worse than lightweight models with low request numbers but show performance fluctuations as the request number increases. TTFT and latency significantly increase, especially at 64 and 128 requests, where the processing ability notably declines.
2. Suitable for handling moderately complex tasks, such as knowledge base queries and sentiment analysis, where inference efficiency is essential.
Heavyweight Models (greater than 70B parameters)
1. Heavyweight models remain stable with low request numbers, but as the number of requests increases, end-to-end latency increases significantly, TTFT rises, and TPS drops. These models are better suited for high-concurrency request demands.
2. Ideal for applications requiring extremely high model accuracy, offering precise results despite higher inference delays, and suited for high-end data analysis, large-scale content generation, and other high-demand use cases.

As enterprises increasingly rely on LLMs for business innovation, selecting the right inference solution becomes crucial.

Spingence's Edgestar platform, with its robust hardware configuration and efficient inference capabilities, helps businesses achieve stable and efficient operation when deploying LLMs locally, whether dealing with low or high concurrency workloads.

Choose Edgestar to give your enterprise a sustained performance advantage in the application of large language models.🚀