Customizing Detection

Connect with one of our privacy experts to run this code.

Each use case is different and you may sometimes need to adjust the types of entities that Limina detects to address your specific requirements. This guide introduces a few techniques to modify and extend the detection engine to get the most from your data and is organized into two parts:

Part 1: Enabling and Disabling Entity Types covers enabling and disabling the types of entities that are detected.
Part 2: Filters covers allow & block list functionality and including regexes.

The techniques described in the following sections apply to most of the Limina APIs. In particular, they can be used to customize the detection in the NER route, the Process Text route, the File URI route and the File Base64 route. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.

Enabling and Disabling Entity Types

Limina detects over 50 unique entity types ranging from personal, credit card and medical information. By default, all non-beta supported types are detected but this can be easily customized via entity selectors. If you need to comply to an existing legislation like GDPR or HIPAA, you may want to de-identify only the entities covered by this regulation. This can easily be done with preset entity groups. Or you may prefer to detect your own set of entities. This can also be done using entity selectors.

An example Python script showing how to use entity selectors with Limina’s Python client can be found here.

Configuring Entity Selectors

Entity selectors let you enable or disable entity types as part of your API request. You can, for example, enable NAME and ORGANIZATION while ignoring all other entity types by using the ENABLE selector as shown in this request:

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["ORGANIZATION", "NAME"]
      }
    ]
  }
}

As expected, the name and organization mentions are redacted while other entities like dates (i.e. 2019) and conditions (i.e. COVID) are left untouched in the de-identified text:

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

[
  {
    "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 28
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 119,
          "end_idx_processed": 135
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6382
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 257,
          "end_idx_processed": 273
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6274
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 366,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.8899
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 384
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.8863
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

It is sometimes simpler to specify the entities to disable. This can be done using a DISABLE selector:

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "DISABLE",
        "value": ["DATE", "DATE_INTERVAL"]
      }
    ]
  }
}

The above request will redact all entity types except dates and date intervals as shown by the corresponding output:

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

[
  {
    "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 28
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 119,
          "end_idx_processed": 135
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6382
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 257,
          "end_idx_processed": 273
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6274
        }
      },
      {
        "processed_text": "CONDITION_1",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 296,
          "end_idx_processed": 309
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9187
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 374,
          "end_idx_processed": 382
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3601,
          "NAME": 0.8899,
          "NAME_FAMILY": 0.5475
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 384,
          "end_idx_processed": 392
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3714,
          "NAME": 0.8863,
          "NAME_FAMILY": 0.53
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Preset Entity Groups

If you need to comply with a specific legislation like HIPAA, the de-identification service makes it easy for you. You can simply choose from the list of preset entity groups:

['GDPR', 'GDPR_SENSITIVE', 'HIPAA_SAFE_HARBOR', 'CPRA', 'QUEBEC_PRIVACY_ACT', 'APPI', 'APPI_SENSITIVE', 'PCI', 'HEALTH_INFORMATION']

. For details of what is contained in each group, please consult our entities page. This is an example on how to enable all entities covered by the GDPR legislation:

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["GDPR"]
      }
    ]
  }
}

In the response below, the GDPR entities like NAME and CONDITION are redacted while ORGANIZATION mentions are not since there are not part of GDPR:

 "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

[
  {
    "processed_text": "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "CONDITION_1",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 304
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9187
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 369,
          "end_idx_processed": 377
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3601,
          "NAME": 0.8899,
          "NAME_FAMILY": 0.5475
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 379,
          "end_idx_processed": 387
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3714,
          "NAME": 0.8863,
          "NAME_FAMILY": 0.53
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Security Considerations

There are several reasons that may motivate the use of selectors to limit the set of entities:

You are working with data from a specific domain which may not contain some of the supported entities. For example, medical data are unlikely to contain PCI information like credit card number. Disabling these entities will prevent potential false-positives.
Some entities, while present in your data, may not be regarded as sensitive in your use case. For example, your data may contain generic URLs and filenames that can’t be used to identify individuals.

It is, however, important to understand that disabling entities may increase the risk that sensitive information is leaked. When selecting the list of entities to redact, we encourage you to take extra time and care to think on all the possible implications. A good practice is to have an expert validation to confirm that the redacted contents is free of PII or other sensitive information. Following this validation, the list may be adjusted according to the findings.

Advanced Topics

This section presents advanced techniques to help you get the most of the Limina service.

Combining Selectors

It is possible to combine selectors to help create the desired subset of entities. For example, you may be interested in complying with GDPR and HIPAA legislations. This can be done by listing these two groups in an ENABLE selector.

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["GDPR", "HIPAA_SAFE_HARBOR"]
      }
    ]
  }
}

You can also pick and choose the list the entities from groups and individual entity types.

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["GDPR", "ORGANIZATION"]
      }
    ]
  }
}

The above request will redact all GDPR entities as well as ORGANIZATION. It is also possible to combine ENABLE and DISABLE selectors.

Request Body

{
  "text": [
    "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["GDPR"]
      },
      {
        "type": "DISABLE",
        "value": ["DATE", "DATE_INTERVAL"]
      }
    ]
  }
}

This request is enabling all GDPR entities except DATE and DATE_INTERVAL.

Selector PrecedenceWhen combining several selectors, the list of enable entities is first computed by expanding all entity types and groups in the ENABLE selectors. The entity types and groups listed in DISABLE selectors are then expended and removed from that list to form the final list of entities.When no ENABLE selectors are specified, it is assumed that the all supported entities are enabled. In this case, DISABLE selectors will remove from the list of all supported entities.

Selective redaction

If you want to redact only one or two entities out of 50+ entity types that we support, you can use the ENABLE selector. For example, to redact only NAME_FAMILY and LOCATION_ADDRESS_STREET in the text below, we use ENABLE selector and specify these two entities:

Request Body

{
  "text": [
    "Hello there! I am Alice McGee, residing at 325 Sophia St, Port Coquitlam. You can reach me at 235 123-9876 or via email at alicemcgee@gmail.com. I work at SFU."
  ], 
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["NAME_FAMILY", "LOCATION_ADDRESS_STREET"]
      }
    ]
  }
}

The above request will keep NAME_GIVEN, LOCATION_CITY, PHONE_NUMBER, EMAIL_ADDRESS and ORGANIZATION visible and redact only NAME_FAMILY and LOCATION_STREET_ADDRESS. In other cases, you may only want to redact PCI but keep all other entities unredacted, specifically ACCOUNT_NUMBER in the following example:

Request Body

{
  "text": [
    "His account number at WaveNow Digital is 6787655, and it is connected to his bank account at First National Bank, 987654321. But he also has a credit card on file, ending with 7876, expires on 10/25, and his billing address is 456 41st Avenue, Lower Valley."
  ], 
  "entity_detection": {
    "entity_types": [
      {
        "type": "ENABLE",
        "value": ["PCI"]
      }
    ],
    "enable_non_max_suppression": true
  }
}

In this example, BANK_ACCOUNT, CREDIT_CARD, CREDIT_CARD_EXPIRATION and CVV will be redacted, while ACCOUNT_NUMBER and LOCATION_ADDRESS will be visible.

Understanding Multi-Label Predictions

At the core of the Limina service lies a model that performs named entity recognition (NER). In essence, NER models seek to classify each word in a text into a fixed set of classes: the entity types. NER is often framed as a multi-class multi-label problem since a single word can have several labels. For example, in Simon Fraser University the word Simon is both part of a name (i.e. Simon Fraser) and an organization (i.e. Simon Fraser University). To create the de-identified or redacted output, the service must select among the predicted labels the one that best represent the entity. To do so, it will often prefer longer entities over shorter ones. For example, the service will redact Simon Fraser University as an ORGANIZATION:

Example

I study at Simon Fraser University -> I study at [ORGANIZATION]

instead of a combination of a NAME and an ORGANIZATION:

Example

I study at Simon Fraser University -> I study at [NAME] [ORGANIZATION]

This leads to a much more natural output for the users. Disabling entities has direct impact on that behaviour. Let’s suppose that ORGANIZATION has been disabled when redacting the above text.

Request Body

{
  "text": [
    "I study at Simon Fraser University"
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "DISABLE",
        "value": ["ORGANIZATION"]
      }
    ]
  }
}

The resulting response might be surprising at first.

"I study at [NAME_1] University"

[
  {
    "processed_text": "I study at [NAME_1] University",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Simon Fraser",
        "location": {
          "stt_idx": 11,
          "end_idx": 23,
          "stt_idx_processed": 11,
          "end_idx_processed": 19
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.2466,
          "NAME": 0.4709,
          "NAME_FAMILY": 0.2037
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 34,
    "languages_detected": {
      "en": 0.8937596678733826
    }
  }
]

Although ORGANIZATION was disabled, a part of the Simon Fraser University was redacted. This is explained by the fact that Simon Fraser is both part of an organization name but also a person name. Given that ORGANIZATION was disabled, the de-identification service picked the second best label for these words which is NAME. Depending on your use case, you may prefer to keep the full organization name in the output. This can be done with the enable_non_max_suppression flag.

Request Body

{
  "text": [
    "I study at Simon Fraser University"
  ],
  "entity_detection": {
    "entity_types": [
      {
        "type": "DISABLE",
        "value": ["ORGANIZATION"]
      }
    ],
    "enable_non_max_suppression": true
  }
}

When the enable_non_max_suppression flag is set to true, the service will ignore labels with lower likelihoods (i.e. NAME in the above example) therefore preventing the redaction of the ORGANIZATION as shown below.

I study at Simon Fraser University

[
  {
    "processed_text": "I study at Simon Fraser University",
    "entities": [],
    "entities_present": false,
    "characters_processed": 34,
    "languages_detected": {
      "en": 0.8937596678733826
    }
  }
]

Note that, as with disabled entities, one should use the enable_non_max_suppression cautiously. Setting this flag to true may increase the chance of leaking sensitive information.

Understanding Hierarchical Types

Some of the supported entities in the de-identification service are structured into hierarchies. The labels NAME and LOCATION are good examples. The NAME hierarchy includes NAME_GIVEN and NAME_FAMILY but also NAME_MEDICAL_PROFESSIONAL while the LOCATION hierarchy contains LOCATION_COUNTRY, LOCATION_STATE, LOCATION_CITY and so on. However, most labels supported by Limina are not part of a hierarchy. The diagram below shows the different hierarchies. Hierarchies are organized from the most general entity types to the most specific ones. In the diagram, the general types are on the left (level 1). As one moves to the right, the entity types become more specific (level 2 and level 3). The diagram shows also whether the label is a direct identifier, a quasi identifier, or neither (i.e., other). It also shows which of the labels are still in beta. When creating the redacted text, the de-identification service will prefer to use the most specific label in a hierarchy instead of the root label. For example, I live in Canada will be redacted as I live in [LOCATION_COUNTRY] instead of the more generic I live in [LOCATION]. This behaviour improves the usability of the data. If you don’t want this level of granularity, you can use the information in the graph above to map more specific labels (e.g., LOCATION_CITY) to more generic ones (e.g., LOCATION). This can be done as a post-processing step on the Limina API output.

Filters

This guide assumes that you have a working knowledge of regular expressions. Limina is using the Python regular expression syntax. You can find more details on the Python re module documentation.

Sometimes referred to as whitelists and blacklists, filters are specifically designed to allow entities (i.e. leave them in the text) or block entities (i.e. redact them from the text) when the entity text follows an expected format. Filters are built using regular expressions.

An example Python script showing how to use filters with Limina’s Python client can be found here.

Allow Filter

How can you redact regular phone numbers while keeping companies’ toll-free numbers in clear? How would you prevent document ID numbers from being detected as a sensitive numerical number? These are two examples of use cases that can be addressed with Allow filters. Allow filters instruct the detection engine to ignore entities when the entity text match a specific pattern. To create an Allow filter, a regular expression pattern is first created. This pattern is then added to the filter list in the entity_detection object of your REST request. Let’s look at a couple of examples.

Allow List

It is possible to feed lists of terms as well as regex patterns to filters. For example, if you want to prevent the detection engine from removing country names you can whitelist them with:

Request Body Entity Detection

"entity_detection": {
  "filter": [
    {
      "type": "ALLOW",
      "pattern": "Canada|Brazil|Italy"
    }
  ]
}

Allowing toll-free numbers

This is an example of a process/text request containing an Allow filter. When run this request will detect and redact phone numbers unless they follow the specific format for toll-free phones:

Request Body

{
  "text": [
    "Call me at 438-555-7343 or at work at 1-800-555-1423"
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "ALLOW",
        "pattern": "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$"
      }
    ]
  }
}

Gives the following:

Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423

[
  {
    "processed_text": "Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423",
    "entities": [
      {
        "processed_text": "PHONE_NUMBER_1",
        "text": "438-555-7343",
        "location": {
          "stt_idx": 11,
          "end_idx": 23,
          "stt_idx_processed": 11,
          "end_idx_processed": 27
        },
        "best_label": "PHONE_NUMBER",
        "labels": {
          "PHONE_NUMBER": 0.9093
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 52,
    "languages_detected": {
      "en": 0.695495069026947
    }
  }
]

As expected, the first phone number is redacted while the second toll-free number is left in the processed text.

Escaping Regular ExpressionsMany regular expressions contain slashes \ and other special characters. It is important to note that the slash \ is a reserved character in json. As such, slashes in string must be escaped as \\ to retain its original meaning. As demonstrated in the example above, the regular expression r"(1-)?(800|888|877|866|855|844|833)-\d{3}-\d{4}$" was escaped to "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$" in the json request body.

Allowing IDs

Let’s look at a different example. Suppose that you are de-identifying contracts of the form:

Example Text

CCT-2022-09-12321: Contract between John Doe and Acme Corp.

THIS AGREEMENT is made ...

It might be difficult for the detection engine to determine if the ID CCT-2022-09-12321 in the document header is sensitive. The sensitivity may depend for example on other information being publicly available. In this case, the detection engine will flag the IDs. However, if you know that these numbers are not sensitive you may prefer to instruct the detection engine to allow such entities:

Request Body

{
  "text": [
    "CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "ALLOW",
        "pattern": "CCT-\\d{4}-\\d{2}-\\d+"
      }
    ]
  }
}

This is the process/text response to the above request:

CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...

[
  {
    "processed_text": "CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "John Doe",
        "location": {
          "stt_idx": 36,
          "end_idx": 44,
          "stt_idx_processed": 36,
          "end_idx_processed": 44
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9287,
          "NAME_GIVEN": 0.3926,
          "NAME_FAMILY": 0.2851
        }
      },
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Acme Corp",
        "location": {
          "stt_idx": 49,
          "end_idx": 58,
          "stt_idx_processed": 49,
          "end_idx_processed": 65
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.885
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 89,
    "languages_detected": {
      "en": 0.9532792568206787
    }
  }
]

Without the Allow filter above, the ID CCT-2022-09-12321 would be redacted as NUMERICAL_PII but as expected it was not redacted thanks to the Allow filter. Let’s say that you want to detect some codes or ids sharing a common format in your data. You can rely on the de-identification service to perform the redaction for you, but it may sometimes be preferable to create your own detection logic and provide a specific label for these entities. This is exactly what block filters are for.

Block List

Similar to the allow list or whitelist, you can create a block list or blacklist to ensure that some common keywords are always detected and removed like so:

Request Body Entity Detection

"entity_detection": {
  "filter": [
    {
      "type": "BLOCK",
      "pattern": "Android|iPhone|Pixel",
      "entity_type": "CELL_TYPE"
    }
  ]
}

Blocking IDs

Let’s look at our contract example above. With the help of block filters, you can redact the contract id as CONTRACT_ID in the document above:

Request Body

{
  "text": [
    "CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "BLOCK",
        "pattern": "CCT-\\d{4}-\\d{2}-\\d+",
        "entity_type": "CONTRACT_ID"
      }
    ]
  }
}

This is the process/text response to the above request:

[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...

[
  {
    "processed_text": "[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
    "entities": [
      {
        "processed_text": "CONTRACT_ID_1",
        "text": "CCT-2022-09-12321",
        "location": {
          "stt_idx": 0,
          "end_idx": 17,
          "stt_idx_processed": 0,
          "end_idx_processed": 15
        },
        "best_label": "CONTRACT_ID",
        "labels": {
          "CONTRACT_ID": 1
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "John Doe",
        "location": {
          "stt_idx": 36,
          "end_idx": 44,
          "stt_idx_processed": 34,
          "end_idx_processed": 42
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9203,
          "NAME_GIVEN": 0.3381,
          "NAME_FAMILY": 0.1802
        }
      },
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Acme Corp",
        "location": {
          "stt_idx": 49,
          "end_idx": 58,
          "stt_idx_processed": 47,
          "end_idx_processed": 63
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7899
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 87,
    "languages_detected": {
      "en": 0.9520388245582581
    }
  }
]

As expected, the contract id in the text has been redacted with our own custom marker. Here is another example with a more complex pattern to match.

Augmenting existing entity type

This is provided as an example and not as a complete solution to redact all ICD numbers.

In this example, we are detecting ICD-10 numbers and adding these entities to the existing CONDITION entity type:

Request Body

{
  "text": [
    "ICD-10 References\nJ18.9 | Pneumonia\nE11.52 | Type 2 diabetes mellitus with certain circulatory complications"
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "BLOCK",
        "pattern": "(?i)([a-t]|[v-z])\\d[a-z0-9](\\.[a-z0-9]{1,4})?",
        "entity_type": "CONDITION"
      }
    ]
  }
}

This is the process/text response to the above request:

ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications

[
  {
    "processed_text": "ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications",
    "entities": [
      {
        "processed_text": "CONDITION_1",
        "text": "J18.9",
        "location": {
          "stt_idx": 18,
          "end_idx": 23,
          "stt_idx_processed": 18,
          "end_idx_processed": 31
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 1
        }
      },
      {
        "processed_text": "CONDITION_2",
        "text": "Pneumonia",
        "location": {
          "stt_idx": 26,
          "end_idx": 35,
          "stt_idx_processed": 34,
          "end_idx_processed": 47
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.8982
        }
      },
      {
        "processed_text": "CONDITION_3",
        "text": "E11.52",
        "location": {
          "stt_idx": 36,
          "end_idx": 42,
          "stt_idx_processed": 48,
          "end_idx_processed": 61
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 1
        }
      },
      {
        "processed_text": "CONDITION_4",
        "text": "Type 2 diabetes mellitus",
        "location": {
          "stt_idx": 45,
          "end_idx": 69,
          "stt_idx_processed": 64,
          "end_idx_processed": 77
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9196
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 108,
    "languages_detected": {
      "en": 0.5311049818992615
    }
  }
]

You can see that the results from the block filter results and detection engine have been combined together to create a more comprehensive CONDITION entity type.

Allow Text Filter (new in 3.7)

Allow text filters are similar to Allow filters but instead of allowing individual entities, they “mark” sections of your document as safe so that no entities are detected and nothing is redacted or de-identified. Let’s consider a simple example.

Allowing a section of a document

Suppose that you have a document which contains a References section with public information only:

Example Text

Conclusion
A section with sensitive information like name (e.g. John Doe) and organization (e.g. Acme Corp).

References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.

By default, this document would be redacted as:

Example Processed Text

Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).

References
[NAME_2], [NAME_3], [NAME_4],and [NAME_5]. [DATE_INTERVAL_1]. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
[NAME_6], [NAME_7], [NAME_8],[NAME_9], [NAME_10], [NAME_11],[NAME_12], and [NAME_13]. [DATE_INTERVAL_2]. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.

But you may prefer to not de-identify the References section since it is not sensitive. This could be done with the Allow Text filter (keeping only the filter in the request for readability):

Request Body

{
  "text": ["..."],
  "entity_detection": {
    "filter": [
      {
        "type": "ALLOW_TEXT",
        "pattern": "References\\s+([\\S\\s]+)",
      }
    ]
  }
}

Which would result in this processed text:

Example Processed Text

"Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).

References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.",

where the References section was not de-identified. Allow Text filters also support capturing groups in regular expressions.

Using capturing groups

Capturing groups are a very useful feature of regular expressions. By adding capturing groups to your regular expression, you can effectively dissect a matched text into the sections of interest. Consider this document including an audit trail with the editor name and the date of the changes:

Example Text

[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [John Hancock: March 14, 2023]

Let’s say you want to de-identify the author name but keep the dates of the audit trail in your processed text. One approach is to use Allow filters. However, it might be difficult to create a proper regular expression to allow all possible date formats. Moreover, all date entities would be allowed and not only those in the audit trail. This is where Allow Text filters and capturing groups become useful. The following request contains an Allow Text filter for the audit trail above:

Request Body

{
  "text": [
    "[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]\n[Conclusion] [John Hancock: March 14, 2023]"
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "ALLOW_TEXT",
        "pattern": "\\[[^:]*:([^\\]]*)\\]"
      }
    ]
  }
}

Notice the capturing group ([^\]]*) in the second part of the pattern. This group is selecting the date, that is, the section of text from the colon : up to the closing square bracket ]. This informs the Allow Text filter that only this section has to be allowed. This produces this processed text:

Example Processed Text

[Part 1] [[NAME_1]: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [[NAME_2]: March 14, 2023]

where names are masked but dates are shown. When groups are present, Allow Text filters will only allow the text matching the groups. This provides the flexibility you need to allow the section of text you want.

A word of cautionThe regular expression pattern in filters can be as complex as it needs to be in order to capture the specific text of interest. However, one should be careful to not create filter patterns that are too generic risking to de-identify unnecessary sections of your document or worse to leave sensitive information unredacted.

Quick summaryLimina uses the re Python package to evaluate regular expressions.As we have seen so far, the difference between ALLOW and ALLOW_TEXT can be summarized as follows:ALLOW checks whether a regex pattern match a detected entity value. If the entity text matches the pattern, the entity is ignored and not redacted. ALLOW_TEXT checks whether detected entities are part of any text fragments returned by re.find with the provided regex pattern. If the entity text is part of a text fragment that matches the pattern, the entity is ignored and not redacted.

​Enabling and Disabling Entity Types

​Configuring Entity Selectors

​Preset Entity Groups

​Security Considerations

​Advanced Topics

​Combining Selectors

​Selective redaction

​Understanding Multi-Label Predictions

​Understanding Hierarchical Types

​Filters

​Allow Filter

​Allow List

​Allowing toll-free numbers

​Allowing IDs

​Block Filter

​Block List

​Blocking IDs

​Augmenting existing entity type

​Allow Text Filter (new in 3.7)

​Allowing a section of a document

​Using capturing groups

Enabling and Disabling Entity Types

Configuring Entity Selectors

Preset Entity Groups

Security Considerations

Advanced Topics

Combining Selectors

Selective redaction

Understanding Multi-Label Predictions

Understanding Hierarchical Types

Filters

Allow Filter

Allow List

Allowing toll-free numbers

Allowing IDs

Block Filter

Block List

Blocking IDs

Augmenting existing entity type

Allow Text Filter (new in 3.7)

Allowing a section of a document

Using capturing groups