Processing Word (DOC/DOCX) Files

Limina supports scanning Microsoft Word DOC & DOCX files for PII and creating de-identified or redacted copies. Limina’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

If you’d like to try it yourself, please sign up for an account to get a free API key.

How DOCX Files Are Processed

Word document support is a new feature. Depending on the complexity of the processed documents, some of their elements might not be properly de-identified. We are working on expanding support; please consider rendering and processing as a PDF. This will ensure all content is processed and redacted.

DOCX files are processed by extracting each element and processing according to the table below. The de-identified or redacted file is created according to the behaviour specified in the table.

Property Type	Details	Default Behaviour	Options
Core properties	Author, Category, Comments, Content Status, Identifier, Keywords, Language, Last Modified By, Subject, Title, Version	Redact	Keep, Redact
Headers and footers	Any content in headers and footers, such as text and images	Redact	Keep, Redact
Tables	Table objects with text and images	Redact	Keep, Redact
Images	The Images page provides a more detailed look at Image processing	Redact, unsupported image types are removed	Redact
Text content	Main body content	Redact	Keep, Redact
Text boxes	Floating text boxes	Redact	Keep, Redact
Embedded links	Hyperlinks to internet pages or documents	Remove	Keep, Redact
External elements	Tables and charts embedded from another document or file, such as an Excel chart	Remove external file, redact cached values	Remove external file, redact cached values
Embedded audio & video	Videos and audio clips	Remove	Remove
Review comments	Comments from document reviews	Redact	Keep, Redact
Shape objects	Shapes containing text	Redact	Keep, Redact
Ink Drawings	Drawings in DOCX documents	Remove	Keep, Remove

See the API Reference for changing the default behaviour.

Graphical content (images) where text is present will be OCRed and then redacted. You can configure the OCR System by setting it as an Environment Variable or sending it in the request object. Check out our OCR Guide to further understand the OCR modes and their usage.

How DOC Files Are Processed

DOC files are processed by converting into DOCX files, followed the process described above and then converting back to DOC files.

Constraints

If a piece of PII text has more than one style (different fonts, font sizes, underline etc.), the redaction marker will use the first style.
Charts in DOCX files contain an underlying .XLSX document that is automatically removed during deidentification. The cached chart values are deidentified by default.
Certain MHT files can be natively opened with Microsoft Word by changing the extension from MHT to DOC. Those files are not supported. We recommend that you use Microsoft Word to convert those files to DOCX.
We recommend using Microsoft Word to open the processed DOC/DOCX files. Other editors may not give ideal results.

Support Matrix

	CPU Container	GPU Container	Community API	Professional API
Supported	Yes	Yes	Up to 10 MiB	No

Sample Request

Connect with one of our privacy experts to run this code.

{
  "file": {
    "data": "<file_content_base64>",
    "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
  },
  "entity_detection": {
    "return_entity": true
  }
}

echo '{
          "file": {"data": "'$(base64 -w 0 sample.docx)'", 
          "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"}, 
          "entity_detection": {"return_entity": "True"}
      }' \
| curl --request POST --url 'https://api.getlimina.ai/community/v4/process/files/base64' \
       -H 'Content-Type: application/json' \
       -H 'x-api-key: <YOUR KEY HERE>' \
       -d @- \
       | jq -r .processed_file \
       | base64 -d > 'sample.redacted.docx'

import requests
import base64

file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.docx"
filename_out = "/path/to/output/sample.redacted.docx"
file_content = requests.get(file_url).content
file_content_base64 = base64.b64encode(file_content).decode("ascii")

url = "https://api.getlimina.ai/community/v4/process/files/base64"

headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

payload = {
  "file":{
    "data": file_content_base64,
    "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  },
  "entity_detection": {
    "return_entity": True
  }
}

response = requests.post(url, json=payload, headers=headers)
with open(filename_out, "wb") as f:
    f.write(base64.b64decode(response.json()["processed_file"]))

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64

filename_in = "sample.docx"
filename_out = "sample.redacted.docx"

file_type= "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
client = PAIClient(url="https://api.getlimina.ai/community/v4/", api_key="<YOUR API KEY>")

with open(filename_in, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

with open(filename_out, 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

Sample Response

Response

{
  "processed_file": "Base64 Encoded File Content of the Redacted File",
  "processed_text": "string",
  "entities": "List[Entity]",
  "entities_present": true,
  "languages_detected": {"lang_1": 0.67, "lang_2": 0.74}
}

​How DOCX Files Are Processed

​How DOC Files Are Processed

​Constraints

​Support Matrix

​Sample Request

​Sample Response

How DOCX Files Are Processed

How DOC Files Are Processed

Constraints

Support Matrix

Sample Request

Sample Response