If you’d like to try it yourself, please visit our free interactive web demo. No code or account is necessary.
How PDFs Are Processed
PDFs are processed as follows:- First, each page in the PDF is rendered as an image. The result is similar to a PDF created by a photocopier scan. This is done to ensure that all PII is properly captured - PDF is a complicated format.
- Each page in the PDF is processed as an image.
- A new PDF is created using the redacted/de-identified images produced in the previous step.
- If specified, an invisible, de-identified text layer is created using the OCR system output. This ensures that the resulting PDF is searchable and allows for text to be copy & pasted.
You can configure the OCR System by setting it as an Environment Variable or sending it in the request object. Check out our OCR Guide to further understand the OCR modes and their usage.
Constraints
- Any attachments in a PDF file are removed.
- If the PDF document to be de-identified already has an invisible text layer, it will be discarded and replaced with a new text-layer created through the use of OCR.
Parameters
Below are the parameters that control the behaviour of the PDF De-identifier. These parameters shall be specified underpdf_options.
| Parameter | Explanation | Default |
|---|---|---|
density | PDFs are converted into images using this DPI value. Smaller values result in images with smaller resolutions, which will take up less storage space and process faster, at the cost of output quality & redaction accuracy. | 200 |
max_resolution | PDFs are converted into images using the density DPI value. Any resulting images with maximum size length larger than this will be resized to this value, while preserving aspect ratio. | 3000 |
Support Matrix
| CPU Container | GPU Container | Community API | Professional API | |
|---|---|---|---|---|
| Supported | Yes | Yes | Up to 10 MiB | No |
Sample Request
Connect with one of our privacy experts to run this code.
Sample Response
Response