> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Benchmarks

> Benchmarks provides some performance numbers on the CPU and GPU containers including recommendations for best throughput.

<Info>
  Looking for accuracy benchmarks? Please download our [Whitepaper](https://getlimina.ai/resources/whitepaper?utm_campaign=2024%20Content\&utm_source=docs\&utm_medium=docs\&utm_term=docs)
</Info>

Benchmarked against container version <Badge color="blue">4.2.0</Badge>

## NER Benchmarks

The following section provides some NER performance figures for Limina's CPU and GPU containers on various VM instance types, including the hardware in the [system requirements](/installation/prerequisites-and-system-requirements).

These numbers have been computed by generating load on the `process/text` route using the default settings (i.e., `HIGH_AUTOMATIC` accuracy mode and `heuristics` coreference). Requests to the `process/text` route were created using an internal dataset of English examples of varied length. The load was scaled to a concurrency level maximizing the throughput of the `process/text` endpoint. Therefore, you could expect a lower latency if you have a lower load. A latency as low as 10ms can be achieved on a 100-words input when using a GPU deployment.

### NER Performance on CPU

The table below illustrates the performance of the CPU container on various instance types:

<table id="benchmark-table" class="table">
  <thead>
    <tr>
      <th>Platform</th>
      <th>Instance Type</th>
      <th>Throughput<sup>1</sup> (words/sec)</th>
      <th>Average Latency<sup>2</sup> (ms)</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th className="py-1.5 bg-gray-50 text-left" colspan="4">
        Azure
      </th>
    </tr>

    <tr>
      <td />

      <td> Standard\_E2\_v5 (2 vCPUs, 16GB RAM) </td>
      <td>513</td>
      <td>1022</td>
    </tr>

    <tr>
      <td />

      <td>Standard\_E8\_v5 (8 vCPUs, 64GB RAM)</td>
      <td>1719</td>
      <td>304</td>
    </tr>

    <tr>
      <th className="py-1.5 bg-gray-50 text-left" colspan="4">
        AWS
      </th>
    </tr>

    <tr>
      <td />

      <td> m7i.xlarge (4 vCPUs, 16GB RAM) </td>
      <td>834</td>
      <td>628</td>
    </tr>

    <tr>
      <td />

      <td>m7i.4xlarge (16 vCPUs, 64GB RAM)</td>
      <td>1843</td>
      <td>285</td>
    </tr>
  </tbody>
</table>

<sup>1</sup> Throughput is given in words per second, where a word denotes a whitespace-separated piece of text.

<sup>2</sup> The average example length used for the testing is 131 words. The values in this column are the average latency over all examples.

When using the `STANDARD` or `STANDARD_MULTILINGUAL` accuracy mode, you should expect a throughput that is around 4 to 5 times these numbers. Similarly, the `STANDARD_HIGH` and `STANDARD_HIGH_MULTILINGUAL` accuracy mode will deliver a throughput that is around 3 times these numbers.

Note that the `coreference_resolution` settings `model_prediction` and `combined` have a big impact on performance. You can expect the throughput to be cut by 10 if you enable these options.

Similarly, `SYNTHETIC` entity replacement may reduce throughput by up to 50 times compared to `MARKER` replacements, depending on concurrency and the number of entities replaced.

### NER Performance on GPU

The table below contains the benchmarks of the GPU container running on different instance types equipped with a single GPU. Note that the Limina GPU container is designed to run on a single GPU and will not leverage multiple GPUs.

<table class="table">
  <thead>
    <tr>
      <th>Platform</th>
      <th>Instance Type</th>
      <th>Throughput<sup>1</sup></th>
      <th>Average Latency<sup>2</sup></th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <th className="py-1.5 bg-tertiary-background" colspan="4">
        Azure
      </th>
    </tr>

    <tr>
      <td />

      <td>Standard\_NC4as\_T4\_v3 (4 vCPUs, 28GB RAM)</td>
      <td>11900</td>
      <td>131</td>
    </tr>

    <tr>
      <td />

      <td>Standard\_NC8as\_T4\_v3 (8 vCPUs, 56GB RAM)</td>
      <td>11450</td>
      <td>137</td>
    </tr>

    <tr>
      <th className="py-1.5 bg-tertiary-background" colspan="4">
        AWS
      </th>
    </tr>

    <tr>
      <td />

      <td>g4dn.2xlarge (8 vCPUs, 32GB RAM) </td>
      <td>12100</td>
      <td>538</td>
    </tr>

    <tr>
      <td />

      <td>g4dn.4xlarge (16 vCPUs, 64GB RAM)</td>
      <td>14000</td>
      <td>186</td>
    </tr>

    <tr>
      <td />

      <td>g5.4xlarge (16 vCPUs, 64GB RAM)</td>
      <td>28200</td>
      <td>453</td>
    </tr>
  </tbody>
</table>

<sup>1</sup> Throughput is given in words per second, where a word denotes a whitespace-separated piece of text.

<sup>2</sup> The average example length used for the testing is 131 words. The values in this column are the average latency over all examples.

When using the `STANDARD` accuracy mode, you should expect a throughput that is around 4 times these numbers. Similarly, the `STANDARD_HIGH` accuracy mode will deliver a throughput that is 3 times these numbers.

It is not recommended to use the `model_prediction` or `combined` coreference resolution modes with the GPU container because of the big impact these modes have on throughput.

The `SYNTHETIC` entity replacement option is likewise not recommended with the GPU container, as it largely negates the performance benefits of running on GPU.

## PDF and Image Benchmark

Limina recommends that documents are processed on GPU instances. Below are benchmarks of the GPU container with all document features enabled. This includes the default OCR, object detection and NER modes.

Note that PDF are processed as images so the processing of a page of PDF is roughly equivalent to the processing of one image.

Note also that the processing time of PDF and images may vary depending on the image size and its resolution and the amount of text.

| Instance Type | Throughput (pages/sec) |
| ------------- | ---------------------- |
| g4dn.2xlarge  | 1.41                   |

## Audio Benchmark

The throughput of the Limina audio processing is provided below for the Private GPU and the CPU images on a common AWS instance.

| Instance Type | Image Type | Throughput (RTFx<sup>1</sup>) |
| ------------- | ---------- | ----------------------------- |
| g4dn.2xlarge  | GPU        | \~23.0                        |
| m7i.4xlarge   | CPU        | \~2.8                         |

<sup>1</sup> The RTFx (inverse realtime factor) is measuring how many minutes of audio can be processed in 1 minute. A RTFx value of 30 means that an hour of audio recording is processed in 2 minutes.

As you can see from the above results, the GPU image is around 10 times faster than the CPU image.

## Additional Guidelines

### Hardware

Hardware type matters. `m5zn` instances powered by recent Intel Xeon CPUs with **AVX512 VNNI** support perform over 3X faster than generic instances like `c5`. For this reason, it is recommended to use the hardware specified in the [system requirements](/installation/prerequisites-and-system-requirements).

As such, it is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the `c5`.

### Scaling Considerations

#### Latency

The latency on the `process/text` endpoint scales approximately linearly with the request length. To reduce latency one can call the `process/text` endpoint with smaller requests. This, in general, will not improve throughput. However, feeding the models with very short inputs (i.e., a few words to a couple of sentences) may reduce the models' accuracy because of the lack of context.

#### Throughput

To maximize throughput, it is recommended to use a large number of concurrent requests. Batching smaller requests together does not improve throughput significantly.

For very large deployments, GPU instances are recommended. A single low cost inference instance such as the `g4dn.2xlarge` (\~\$0.752 USD per hour) can process 1GB of unicode text in under an hour.

Note that scaling GPU instances requires to balance the GPU and CPU resources. As a rule of thumb, larger models like the `high` accuracy NER model are GPU bound, while smaller models like the `standard` accuracy NER model are CPU bound (on a GPU instance with too few CPU cores).

It is best to experiment with a few configurations to find the one that best fit your data.

### Kubernetes Deployments

You should expect slightly lower numbers for throughput and latency when running the Limina container on Kubernetes deployments using any of the instance types above. This is due to the overhead of the Kubernetes environment and resources being possibly reserved for the Kubernetes processes.
