> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Processing PDF Files (Standard)

> This guide will get you started with pdf deidentification.

Limina supports scanning PDF files for PII and creating de-identified or redacted copies. Limina’s supported entity types function across each file type, with localized variants of different **PII** (Personally Identifiable Information) entities, **PHI** (Protected Health Information) entities, and **PCI** (Payment Card Industry) entities being detected. Our [Supported Languages](/languages) and [Supported Entity Types](/entities) page provides a more detailed look.

<Info>
  If you'd like to try it yourself, please [sign up for an account](https://portal.getlimina.ai/) to get a free API key.
</Info>

## How PDFs Are Processed

PDFs are processed as follows:

1. First, each page in the PDF is rendered as an image. The result is similar to a PDF created by a photocopier scan. This is done to ensure that all PII is properly captured - PDF is a complicated [format](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf).
2. Each page in the PDF is processed as an [image](/configuration-and-operations/working-with-files/processing-files/image).
3. A new PDF is created using the redacted/de-identified images produced in the previous step.
4. If specified, an invisible, de-identified text layer is created using the OCR system output. This ensures that the resulting PDF is searchable and allows for text to be copy & pasted.

<Info>
  You can configure the OCR System by setting it as an [Environment Variable](/configuration-and-operations/container-management/environment-variables) or sending it in the request object. Check out our [OCR Guide](/configuration-and-operations/working-with-files/processing-files/ocr-modes) to further understand the OCR modes and their usage.
</Info>

## Constraints

* Any attachments in a PDF file are removed.
* If the PDF document to be de-identified already has an invisible text layer, it will be discarded and replaced with a new text-layer created through the use of OCR.

## Parameters

Below are the parameters that control the behaviour of the PDF De-identifier. These parameters shall be specified under `pdf_options`.

| Parameter        | Explanation                                                                                                                                                                                                                  | Default    |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
| `approach`       | This parameter changes which PDF approach is used.                                                                                                                                                                           | "standard" |
| `density`        | PDFs are converted into images using this DPI value. Smaller values result in images with smaller resolutions, which will take up less storage space and process faster, at the cost of output quality & redaction accuracy. | 200        |
| `max_resolution` | PDFs are converted into images using the `density` DPI value. Any resulting images with maximum size length larger than this will be resized to this value, while preserving aspect ratio.                                   | 3000       |

<Info>
  [PDF Approaches](/configuration-and-operations/working-with-files/processing-files/pdf-comparison) shows the differences between Standard and Enhanced PDF processing.
</Info>

## Support Matrix

|           | CPU Container | GPU Container | Community API | Professional API |
| --------- | ------------- | ------------- | ------------- | ---------------- |
| Supported | Yes           | Yes           | Up to 10 MiB  | No               |

## Sample Request

<Info>
  [Connect with one of our privacy experts](https://getlimina.ai/contact-us/?utm_source=docs\&utm_medium=website) to run this code.
</Info>

<CodeGroup>
  ```json Request Body wrap lines theme={"theme":"poimandres"}
  {
    "file": {
      "data": "<file_content_base64>",
      "content_type": "application/pdf"
    },
    "entity_detection": {
      "return_entity": true
    },
    "pdf_options": {
      "approach":"standard"
    }
  }
  ```

  ```shell curl wrap lines theme={"theme":"poimandres"}
  echo '{
            "file": {"data": "'$(base64 -w 0 sample.pdf)'", 
            "content_type": "application/pdf"}, 
            "entity_detection": {"return_entity": "True"},
            "pdf_options": {"approach": "standard"},
        }' \
  | curl --request POST --url 'https://api.private-ai.com/community/v4/process/files/base64' \
         -H 'Content-Type: application/json' \
         -H 'x-api-key: <YOUR KEY HERE>' \
         -d @- \
         | jq -r .processed_file \
         | base64 -d > 'sample.redacted.pdf'
  ```

  ```python python wrap lines theme={"theme":"poimandres"}
  import requests
  import base64

  file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.pdf"
  filename_out = "/path/to/output/sample.redacted.pdf"
  file_content = requests.get(file_url).content
  file_content_base64 = base64.b64encode(file_content).decode()

  url = "https://api.private-ai.com/community/v4/process/files/base64"

  headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

  payload = {
    "file":{
      "data": file_content_base64,
      "content_type": "application/pdf",
    },
    "entity_detection": {
      "return_entity": True
    },
    "pdf_options": {
      "approach": "standard"
    }
  }

  response = requests.post(url, json=payload, headers=headers)
  with open(filename_out, "wb") as f:
      f.write(base64.b64decode(response.json()["processed_file"]))
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client.objects import request_objects
  import base64

  filename_in = "sample.pdf"
  filename_out = "sample.redacted.pdf"

  file_type= "application/pdf"
  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  with open(filename_in, "rb") as b64_file:
      file_data = base64.b64encode(b64_file.read())
      file_data = file_data.decode("ascii")

  file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
  request_obj = request_objects.file_base64_obj(file=file_obj)
  resp = client.process_files_base64(request_object=request_obj)

  with open(filename_out, 'wb') as redacted_file:
      processed_file = resp.processed_file.encode("ascii")
      processed_file = base64.b64decode(processed_file, validate=True)
      redacted_file.write(processed_file)
  ```
</CodeGroup>

## Sample Response

```json Response wrap lines theme={"theme":"poimandres"}
{
  "processed_file": "Base64 Encoded File Content of the Redacted File",
  "processed_text": "string",
  "entities": "List[Entity]",
  "entities_present": true,
  "languages_detected": {"lang_1": 0.67, "lang_2": 0.74}
}
```
