> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Processing Word (DOC/DOCX) Files

> This guide will get you started with docx deidentification.

Limina supports scanning Microsoft Word DOC & DOCX files for PII and creating de-identified or redacted copies. Limina’s supported entity types function across each file type, with localized variants of different **PII** (Personally Identifiable Information) entities, **PHI** (Protected Health Information) entities, and **PCI** (Payment Card Industry) entities being detected. Our [Supported Languages](/languages) and [Supported Entity Types](/entities) page provides a more detailed look.

<Info>
  If you'd like to try it yourself, please [sign up for an account](https://portal.getlimina.ai/) to get a free API key.
</Info>

## How DOCX Files Are Processed

<Warning>
  Word document support is a new feature. Depending on the complexity of the processed documents, some of their elements might not be properly de-identified. We are working on expanding support; please consider rendering and processing as a [PDF](/configuration-and-operations/working-with-files/processing-files/pdf). This will ensure all content is processed and redacted.
</Warning>

DOCX files are processed by extracting each element and processing according to the table below. The de-identified or redacted file is created according to the behaviour specified in the table.

| Property Type          | Details                                                                                                                                      | Default Behaviour                           | Options                                    |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------- | ------------------------------------------ |
| Core properties        | Author, Category, Comments, Content Status, Identifier, Keywords, Language, Last Modified By, Subject, Title, Version                        | Redact                                      | Keep, Redact                               |
| Headers and footers    | Any content in headers and footers, such as text and images                                                                                  | Redact                                      | Keep, Redact                               |
| Tables                 | Table objects with text and images                                                                                                           | Redact                                      | Keep, Redact                               |
| Images                 | The [Images](/configuration-and-operations/working-with-files/processing-files/image) page provides a more detailed look at Image processing | Redact, unsupported image types are removed | Redact                                     |
| Text content           | Main body content                                                                                                                            | Redact                                      | Keep, Redact                               |
| Text boxes             | Floating text boxes                                                                                                                          | Redact                                      | Keep, Redact                               |
| Embedded links         | Hyperlinks to internet pages or documents                                                                                                    | Remove                                      | Keep, Redact                               |
| External elements      | Tables and charts embedded from another document or file, such as an Excel chart                                                             | Remove external file, redact cached values  | Remove external file, redact cached values |
| Embedded audio & video | Videos and audio clips                                                                                                                       | Remove                                      | Remove                                     |
| Review comments        | Comments from document reviews                                                                                                               | Redact                                      | Keep, Redact                               |
| Shape objects          | Shapes containing text                                                                                                                       | Redact                                      | Keep, Redact                               |
| Ink Drawings           | Drawings in DOCX documents                                                                                                                   | Remove                                      | Keep, Remove                               |

<Info>
  See the [API Reference](/latest/process-files-base64) for changing the default behaviour.
</Info>

<Info>
  Graphical content (images) where text is present will be OCRed and then redacted. You can configure the OCR System by setting it as an [Environment Variable](/configuration-and-operations/container-management/environment-variables) or sending it in the request object. Check out our [OCR Guide](/configuration-and-operations/working-with-files/processing-files/ocr-modes) to further understand the OCR modes and their usage.
</Info>

## How DOC Files Are Processed

DOC files are processed by converting into DOCX files, followed the process described above and then converting back to DOC files.

## Constraints

* If a piece of PII text has more than one style (different fonts, font sizes, underline etc.), the redaction marker will use the first style.
* Charts in DOCX files contain an underlying .XLSX document that is automatically removed during deidentification. The cached chart values are deidentified by default.
* Certain MHT files can be natively opened with Microsoft Word by changing the extension from MHT to DOC. Those files are not supported. We recommend that you use Microsoft Word to convert those files to DOCX.
* We recommend using Microsoft Word to open the processed DOC/DOCX files. Other editors may not give ideal results.

## Support Matrix

|           | CPU Container | GPU Container | Community API | Professional API |
| --------- | ------------- | ------------- | ------------- | ---------------- |
| Supported | Yes           | Yes           | Up to 10 MiB  | No               |

## Sample Request

<Info>
  [Connect with one of our privacy experts](https://getlimina.ai/en/contact-us/?utm_source=docs\&utm_medium=website) to run this code.
</Info>

<CodeGroup>
  ```json Request Body wrap lines theme={"theme":"poimandres"}
  {
    "file": {
      "data": "<file_content_base64>",
      "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    },
    "entity_detection": {
      "return_entity": true
    }
  }
  ```

  ```shell curl wrap lines theme={"theme":"poimandres"}
  echo '{
            "file": {"data": "'$(base64 -w 0 sample.docx)'", 
            "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"}, 
            "entity_detection": {"return_entity": "True"}
        }' \
  | curl --request POST --url 'https://api.private-ai.com/community/v4/process/files/base64' \
         -H 'Content-Type: application/json' \
         -H 'x-api-key: <YOUR KEY HERE>' \
         -d @- \
         | jq -r .processed_file \
         | base64 -d > 'sample.redacted.docx'
  ```

  ```python python wrap lines theme={"theme":"poimandres"}
  import requests
  import base64

  file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.docx"
  filename_out = "/path/to/output/sample.redacted.docx"
  file_content = requests.get(file_url).content
  file_content_base64 = base64.b64encode(file_content).decode("ascii")

  url = "https://api.private-ai.com/community/v4/process/files/base64"

  headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

  payload = {
    "file":{
      "data": file_content_base64,
      "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    },
    "entity_detection": {
      "return_entity": True
    }
  }

  response = requests.post(url, json=payload, headers=headers)
  with open(filename_out, "wb") as f:
      f.write(base64.b64decode(response.json()["processed_file"]))
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client.objects import request_objects
  import base64

  filename_in = "sample.docx"
  filename_out = "sample.redacted.docx"

  file_type= "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  with open(filename_in, "rb") as b64_file:
      file_data = base64.b64encode(b64_file.read())
      file_data = file_data.decode("ascii")

  file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
  request_obj = request_objects.file_base64_obj(file=file_obj)
  resp = client.process_files_base64(request_object=request_obj)

  with open(filename_out, 'wb') as redacted_file:
      processed_file = resp.processed_file.encode("ascii")
      processed_file = base64.b64decode(processed_file, validate=True)
      redacted_file.write(processed_file)
  ```
</CodeGroup>

## Sample Response

```json Response wrap lines theme={"theme":"poimandres"}
{
  "processed_file": "Base64 Encoded File Content of the Redacted File",
  "processed_text": "string",
  "entities": "List[Entity]",
  "entities_present": true,
  "languages_detected": {"lang_1": 0.67, "lang_2": 0.74}
}
```
