Skip to main content
Looking for accuracy benchmarks? Please download our Whitepaper
Benchmarked against container version 4.2.0

NER Benchmarks

The following section provides some NER performance figures for Limina’s CPU and GPU containers on various VM instance types, including the hardware in the system requirements. These numbers have been computed by generating load on the process/text route using the default settings (i.e., HIGH_AUTOMATIC accuracy mode and heuristics coreference). Requests to the process/text route were created using an internal dataset of English examples of varied length. The load was scaled to a concurrency level maximizing the throughput of the process/text endpoint. Therefore, you could expect a lower latency if you have a lower load. A latency as low as 10ms can be achieved on a 100-words input when using a GPU deployment.

NER Performance on CPU

The table below illustrates the performance of the CPU container on various instance types:
PlatformInstance TypeThroughput1 (words/sec)Average Latency2 (ms)
Azure
Standard_E2_v5 (2 vCPUs, 16GB RAM) 5131022
Standard_E8_v5 (8 vCPUs, 64GB RAM)1719304
AWS
m7i.xlarge (4 vCPUs, 16GB RAM) 834628
m7i.4xlarge (16 vCPUs, 64GB RAM)1843285
1 Throughput is given in words per second, where a word denotes a whitespace-separated piece of text. 2 The average example length used for the testing is 131 words. The values in this column are the average latency over all examples. When using the STANDARD or STANDARD_MULTILINGUAL accuracy mode, you should expect a throughput that is around 4 to 5 times these numbers. Similarly, the STANDARD_HIGH and STANDARD_HIGH_MULTILINGUAL accuracy mode will deliver a throughput that is around 3 times these numbers. Note that the coreference_resolution settings model_prediction and combined have a big impact on performance. You can expect the throughput to be cut by 10 if you enable these options. Similarly, SYNTHETIC entity replacement may reduce throughput by up to 50 times compared to MARKER replacements, depending on concurrency and the number of entities replaced.

NER Performance on GPU

The table below contains the benchmarks of the GPU container running on different instance types equipped with a single GPU. Note that the Limina GPU container is designed to run on a single GPU and will not leverage multiple GPUs.
PlatformInstance TypeThroughput1Average Latency2
Azure
Standard_NC4as_T4_v3 (4 vCPUs, 28GB RAM)11900131
Standard_NC8as_T4_v3 (8 vCPUs, 56GB RAM)11450137
AWS
g4dn.2xlarge (8 vCPUs, 32GB RAM) 12100538
g4dn.4xlarge (16 vCPUs, 64GB RAM)14000186
g5.4xlarge (16 vCPUs, 64GB RAM)28200453
1 Throughput is given in words per second, where a word denotes a whitespace-separated piece of text. 2 The average example length used for the testing is 131 words. The values in this column are the average latency over all examples. When using the STANDARD accuracy mode, you should expect a throughput that is around 4 times these numbers. Similarly, the STANDARD_HIGH accuracy mode will deliver a throughput that is 3 times these numbers. It is not recommended to use the model_prediction or combined coreference resolution modes with the GPU container because of the big impact these modes have on throughput. The SYNTHETIC entity replacement option is likewise not recommended with the GPU container, as it largely negates the performance benefits of running on GPU.

PDF and Image Benchmark

Limina recommends that documents are processed on GPU instances. Below are benchmarks of the GPU container with all document features enabled. This includes the default OCR, object detection and NER modes. Note that PDF are processed as images so the processing of a page of PDF is roughly equivalent to the processing of one image. Note also that the processing time of PDF and images may vary depending on the image size and its resolution and the amount of text.
Instance TypeThroughput (pages/sec)
g4dn.2xlarge1.41

Audio Benchmark

The throughput of the Limina audio processing is provided below for the Private GPU and the CPU images on a common AWS instance.
Instance TypeImage TypeThroughput (RTFx1)
g4dn.2xlargeGPU~23.0
m7i.4xlargeCPU~2.8
1 The RTFx (inverse realtime factor) is measuring how many minutes of audio can be processed in 1 minute. A RTFx value of 30 means that an hour of audio recording is processed in 2 minutes. As you can see from the above results, the GPU image is around 10 times faster than the CPU image.

Additional Guidelines

Hardware

Hardware type matters. m5zn instances powered by recent Intel Xeon CPUs with AVX512 VNNI support perform over 3X faster than generic instances like c5. For this reason, it is recommended to use the hardware specified in the system requirements. As such, it is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the c5.

Scaling Considerations

Latency

The latency on the process/text endpoint scales approximately linearly with the request length. To reduce latency one can call the process/text endpoint with smaller requests. This, in general, will not improve throughput. However, feeding the models with very short inputs (i.e., a few words to a couple of sentences) may reduce the models’ accuracy because of the lack of context.

Throughput

To maximize throughput, it is recommended to use a large number of concurrent requests. Batching smaller requests together does not improve throughput significantly. For very large deployments, GPU instances are recommended. A single low cost inference instance such as the g4dn.2xlarge (~$0.752 USD per hour) can process 1GB of unicode text in under an hour. Note that scaling GPU instances requires to balance the GPU and CPU resources. As a rule of thumb, larger models like the high accuracy NER model are GPU bound, while smaller models like the standard accuracy NER model are CPU bound (on a GPU instance with too few CPU cores). It is best to experiment with a few configurations to find the one that best fit your data.

Kubernetes Deployments

You should expect slightly lower numbers for throughput and latency when running the Limina container on Kubernetes deployments using any of the instance types above. This is due to the overhead of the Kubernetes environment and resources being possibly reserved for the Kubernetes processes.