> ## Documentation Index > Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt > Use this file to discover all available pages before exploring further. # Benchmarks > Benchmarks provides some performance numbers on the CPU and GPU containers including recommendations for best throughput. Looking for accuracy benchmarks? Please download our [Whitepaper](https://getlimina.ai/resources/whitepaper?utm_campaign=2024%20Content\&utm_source=docs\&utm_medium=docs\&utm_term=docs) Benchmarked against container version 4.2.0 ## NER Benchmarks The following section provides some NER performance figures for Limina's CPU and GPU containers on various VM instance types, including the hardware in the [system requirements](/installation/prerequisites-and-system-requirements). These numbers have been computed by generating load on the `process/text` route using the default settings (i.e., `HIGH_AUTOMATIC` accuracy mode and `heuristics` coreference). Requests to the `process/text` route were created using an internal dataset of English examples of varied length. The load was scaled to a concurrency level maximizing the throughput of the `process/text` endpoint. Therefore, you could expect a lower latency if you have a lower load. A latency as low as 10ms can be achieved on a 100-words input when using a GPU deployment. ### NER Performance on CPU The table below illustrates the performance of the CPU container on various instance types:

Platform	Instance Type	Throughput¹ (words/sec)	Average Latency² (ms)
Azure
	Standard\_E2\_v5 (2 vCPUs, 16GB RAM)	513	1022
	Standard\_E8\_v5 (8 vCPUs, 64GB RAM)	1719	304
AWS
	m7i.xlarge (4 vCPUs, 16GB RAM)	834	628
	m7i.4xlarge (16 vCPUs, 64GB RAM)	1843	285

Platform	Instance Type	Throughput¹	Average Latency²
Azure
	Standard\_NC4as\_T4\_v3 (4 vCPUs, 28GB RAM)	11900	131
	Standard\_NC8as\_T4\_v3 (8 vCPUs, 56GB RAM)	11450	137
AWS
	g4dn.2xlarge (8 vCPUs, 32GB RAM)	12100	538
	g4dn.4xlarge (16 vCPUs, 64GB RAM)	14000	186
	g5.4xlarge (16 vCPUs, 64GB RAM)	28200	453

¹ Throughput is given in words per second, where a word denotes a whitespace-separated piece of text. ² The average example length used for the testing is 131 words. The values in this column are the average latency over all examples. When using the `STANDARD` accuracy mode, you should expect a throughput that is around 4 times these numbers. Similarly, the `STANDARD_HIGH` accuracy mode will deliver a throughput that is 3 times these numbers. It is not recommended to use the `model_prediction` or `combined` coreference resolution modes with the GPU container because of the big impact these modes have on throughput. The `SYNTHETIC` entity replacement option is likewise not recommended with the GPU container, as it largely negates the performance benefits of running on GPU. ## PDF and Image Benchmark Limina recommends that documents are processed on GPU instances. Below are benchmarks of the GPU container with all document features enabled. This includes the default OCR, object detection and NER modes. Note that PDF are processed as images so the processing of a page of PDF is roughly equivalent to the processing of one image. Note also that the processing time of PDF and images may vary depending on the image size and its resolution and the amount of text. | Instance Type | Throughput (pages/sec) | | ------------- | ---------------------- | | g4dn.2xlarge | 1.41 | ## Audio Benchmark The throughput of the Limina audio processing is provided below for the Private GPU and the CPU images on a common AWS instance. | Instance Type | Image Type | Throughput (RTFx¹) | | ------------- | ---------- | ----------------------------- | | g4dn.2xlarge | GPU | \~23.0 | | m7i.4xlarge | CPU | \~2.8 | ¹ The RTFx (inverse realtime factor) is measuring how many minutes of audio can be processed in 1 minute. A RTFx value of 30 means that an hour of audio recording is processed in 2 minutes. As you can see from the above results, the GPU image is around 10 times faster than the CPU image. ## Additional Guidelines ### Hardware Hardware type matters. `m5zn` instances powered by recent Intel Xeon CPUs with **AVX512 VNNI** support perform over 3X faster than generic instances like `c5`. For this reason, it is recommended to use the hardware specified in the [system requirements](/installation/prerequisites-and-system-requirements). As such, it is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the `c5`. ### Scaling Considerations #### Latency The latency on the `process/text` endpoint scales approximately linearly with the request length. To reduce latency one can call the `process/text` endpoint with smaller requests. This, in general, will not improve throughput. However, feeding the models with very short inputs (i.e., a few words to a couple of sentences) may reduce the models' accuracy because of the lack of context. #### Throughput To maximize throughput, it is recommended to use a large number of concurrent requests. Batching smaller requests together does not improve throughput significantly. For very large deployments, GPU instances are recommended. A single low cost inference instance such as the `g4dn.2xlarge` (\~\$0.752 USD per hour) can process 1GB of unicode text in under an hour. Note that scaling GPU instances requires to balance the GPU and CPU resources. As a rule of thumb, larger models like the `high` accuracy NER model are GPU bound, while smaller models like the `standard` accuracy NER model are CPU bound (on a GPU instance with too few CPU cores). It is best to experiment with a few configurations to find the one that best fit your data. ### Kubernetes Deployments You should expect slightly lower numbers for throughput and latency when running the Limina container on Kubernetes deployments using any of the instance types above. This is due to the overhead of the Kubernetes environment and resources being possibly reserved for the Kubernetes processes.