> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Named Entity Recognition

> This guide explains how to detect PII in text and files without redaction

<Info>
  [Connect with one of our privacy experts](https://getlimina.ai/contact-us/?utm_source=docs\&utm_medium=website) to run this code.
</Info>

In addition to de-identification and redaction, Limina also supports entity detection. This is useful for data discovery and also allows Limina to be used as a general purpose Named Entity Recognition (NER) Engine. In this guide we demonstrate how to use the [`ner/text`](/latest/ner-text) endpoint introduced in `3.9` to return entities in text and describe an approach to do the same in files.

## Detect entities in text <Badge color="blue">(new in 3.9)</Badge>

The [`ner/text`](/latest/ner-text) route introduced in 3.9 returns a list of detected entities. It can be thought of as a cut-down version of [`process/text`](/latest/process-text) that only returns the list of detected entities, [with a key difference described in the next section](#process-vs-detect-entities). In this snippet we use our Python SDK to invoke the [`ner/text`](/latest/ner-text) route on a short sentence and to return a list of detected entities:

```python Text NER wrap lines theme={"theme":"poimandres"}
text_request = request_objects.ner_text_obj(text=["My sample name is John Smith"])
resp = client.ner_text(text_request)
```

The list of detected entities is found in the `entities` field:

```python Text NER theme={"theme":"poimandres"}
print(json.dumps(resp.entities, indent=4))
```

Yields:

```json NER Response wrap lines theme={"theme":"poimandres"}
[
  [
    {
      "text": "John Smith",
      "location": {
        "stt_idx": 18,
        "end_idx": 28
      },
      "label": "NAME",
      "likelihood": 0.9105876684188843
    },
    {
      "text": "John",
      "location": {
        "stt_idx": 18,
        "end_idx": 22
      },
      "label": "NAME_GIVEN",
      "likelihood": 0.9043319821357727
    },
    {
      "text": "Smith",
      "location": {
        "stt_idx": 23,
        "end_idx": 28
      },
      "label": "NAME_FAMILY",
      "likelihood": 0.9326320886611938
    }
  ]
]
```

## Process vs Detect Entities

There is a key difference between the entities returned in [`process/text`](/latest/process-text) route and [`ner/text`](/latest/ner-text): [`process/text`](/latest/process-text) groups overlapping entity detections into a single entity object, while [`ner/text`](/latest/ner-text) does not. This is evident from the previous example, where *John Smith* detected three different entities: `John Smith`, `John` and `Smith`. The corresponding [`process/text`](/latest/process-text) entity list is:

```json Process Text Json Object theme={"theme":"poimandres"}
[
  {
    "processed_text": "NAME_1",
    "text": "John Smith",
    "location": {
      "stt_idx": 18,
      "end_idx": 28,
      "stt_idx_processed": 18,
      "end_idx_processed": 26
    },
    "best_label": "NAME",
    "labels": {
      "NAME": 0.9106,
      "NAME_GIVEN": 0.4522,
      "NAME_FAMILY": 0.4663
    }
  }
]
```

The [`ner/text`](/latest/ner-text) provides the raw output of the entity detection engine and is recommended if details about all entities discovered in a text fragment, including overlapping ones are required. With the [`ner/text`](/latest/ner-text) route you will be able to answer questions like *Does this text contain zip codes?* or *Does it contain a complete address?* This extra flexibility implies that you should be ready to implement your own post-processing logic.

You should use the [`process/text`](/latest/process-text) if non-overlapping logical entities are required, e.g. to count the number of detected entities.

## Detect entities in files

While the [`ner/text`](/latest/ner-text) route only supports text at this time, it is still possible to achieve a similar behaviour for files with the caveat mentioned in the previous section, only grouped entities are accessible for files.

In this snippet we use our python sdk to process a file as `base64`.

```python Processing File Via Base64 Route theme={"theme":"poimandres"}
# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
    b64_file_data = base64.b64encode(file.read()).decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
```

Here again, we simply take the `.entities` object from the API response, and add it to a dictionary with the original file path set in the `path` key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.

```python File NER wrap lines highlight={2-3} theme={"theme":"poimandres"}
ner_objects: List[Dict[str, Any]] = []
ner_object = dict(path="./sample_pdfs/Letter-of-Intent-pdf.pdf", entities = resp.entities)
ner_objects.append(ner_object)
print(json.dumps(ner_objects, indent=4))
```

Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary.

```json File NER JSON Response wrap lines theme={"theme":"poimandres"}
[
  {
    "path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Sarah Jackson",
        "location": {
          "page": 1,
          "x0": 0.11588,
          "x1": 0.23794,
          "y0": 0.20727,
          "y1": 0.22227
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9185,
          "NAME_GIVEN": 0.4492,
          "NAME_FAMILY": 0.4675
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Best Capital Corp",
        "location": {
          "page": 3,
          "x0": 0.11706,
          "x1": 0.27,
          "y0": 0.87,
          "y1": 0.88909
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8789
        }
      }
    ]
  }
]
```

See how the text *Sarah Jackson* resulted in a single *grouped* entity instead of three different ones for the example above.

In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.

## Wrap Up

Getting a list of entities contained in a text input or in a file is equally simple. The key in this guide is to access the `entities` field in the response. It's that simple 😀. See the API Reference to learn more about the other response fields like `processed_text` and `processed_file`.
