Connect with one of our privacy experts to run this code.
- Part 1: Enabling and Disabling Entity Types covers enabling and disabling the types of entities that are detected.
- Part 2: Filters covers allow & block list functionality and including regexes.
Enabling and Disabling Entity Types
Limina detects over 50 unique entity types ranging from personal, credit card and medical information. By default, all non-beta supported types are detected but this can be easily customized via entity selectors. If you need to comply to an existing legislation like GDPR or HIPAA, you may want to de-identify only the entities covered by this regulation. This can easily be done with preset entity groups. Or you may prefer to detect your own set of entities. This can also be done using entity selectors.An example Python script showing how to use entity selectors with Limina’s Python client can be found here.
Configuring Entity Selectors
Entity selectors let you enable or disable entity types as part of your API request. You can, for example, enableNAME and ORGANIZATION while ignoring all other entity types by using the ENABLE selector as shown in this request:
Request Body
DISABLE selector:
Request Body
Preset Entity Groups
If you need to comply with a specific legislation like HIPAA, the de-identification service makes it easy for you. You can simply choose from the list of preset entity groups:['GDPR', 'GDPR_SENSITIVE', 'HIPAA_SAFE_HARBOR', 'CPRA', 'QUEBEC_PRIVACY_ACT', 'APPI', 'APPI_SENSITIVE', 'PCI', 'HEALTH_INFORMATION']. For details of what is contained in each group, please consult our entities page.
This is an example on how to enable all entities covered by the GDPR legislation:
Request Body
NAME and CONDITION are redacted while ORGANIZATION mentions are not since there are not part of GDPR:
Security Considerations
There are several reasons that may motivate the use of selectors to limit the set of entities:- You are working with data from a specific domain which may not contain some of the supported entities. For example, medical data are unlikely to contain PCI information like credit card number. Disabling these entities will prevent potential false-positives.
- Some entities, while present in your data, may not be regarded as sensitive in your use case. For example, your data may contain generic URLs and filenames that can’t be used to identify individuals.
Advanced Topics
This section presents advanced techniques to help you get the most of the Limina service.Combining Selectors
It is possible to combine selectors to help create the desired subset of entities. For example, you may be interested in complying with GDPR and HIPAA legislations. This can be done by listing these two groups in anENABLE selector.
Request Body
Request Body
ORGANIZATION.
It is also possible to combine ENABLE and DISABLE selectors.
Request Body
DATE and DATE_INTERVAL.
Selector PrecedenceWhen combining several selectors, the list of enable entities is first computed by expanding all entity types and groups in the
ENABLE selectors. The entity types and groups listed in DISABLE selectors are then expended and removed from that list to form the final list of entities.When no ENABLE selectors are specified, it is assumed that the all supported entities are enabled. In this case, DISABLE selectors will remove from the list of all supported entities.Selective redaction
If you want to redact only one or two entities out of 50+ entity types that we support, you can use theENABLE selector. For example, to redact only NAME_FAMILY and LOCATION_ADDRESS_STREET in the text below, we use ENABLE selector and specify these two entities:
Request Body
NAME_GIVEN, LOCATION_CITY, PHONE_NUMBER, EMAIL_ADDRESS and ORGANIZATION visible and redact only NAME_FAMILY and LOCATION_STREET_ADDRESS.
In other cases, you may only want to redact PCI but keep all other entities unredacted, specifically ACCOUNT_NUMBER in the following example:
Request Body
BANK_ACCOUNT, CREDIT_CARD, CREDIT_CARD_EXPIRATION and CVV will be redacted, while ACCOUNT_NUMBER and LOCATION_ADDRESS will be visible.
Understanding Multi-Label Predictions
At the core of the Limina service lies a model that performs named entity recognition (NER). In essence, NER models seek to classify each word in a text into a fixed set of classes: the entity types. NER is often framed as a multi-class multi-label problem since a single word can have several labels. For example, in Simon Fraser University the word Simon is both part of a name (i.e. Simon Fraser) and an organization (i.e. Simon Fraser University). To create the de-identified or redacted output, the service must select among the predicted labels the one that best represent the entity. To do so, it will often prefer longer entities over shorter ones. For example, the service will redact Simon Fraser University as an ORGANIZATION:Example
Example
ORGANIZATION has been disabled when redacting the above text.
Request Body
ORGANIZATION was disabled, a part of the Simon Fraser University was redacted. This is explained by the fact that Simon Fraser is both part of an organization name but also a person name. Given that ORGANIZATION was disabled, the de-identification service picked the second best label for these words which is NAME.
Depending on your use case, you may prefer to keep the full organization name in the output. This can be done with the enable_non_max_suppression flag.
Request Body
enable_non_max_suppression flag is set to true, the service will ignore labels with lower likelihoods (i.e. NAME in the above example) therefore preventing the redaction of the ORGANIZATION as shown below.
enable_non_max_suppression cautiously. Setting this flag to true may increase the chance of leaking sensitive information.
Understanding Hierarchical Types
Some of the supported entities in the de-identification service are structured into hierarchies. The labelsNAME and LOCATION are good examples.
The NAME hierachy includes NAME_GIVEN and NAME_FAMILY but also NAME_MEDICAL_PROFESSIONAL while the LOCATION hierarchy contains LOCATION_COUNTRY, LOCATION_STATE, LOCATION_CITY and so on. Entities forming hierarchies are easily identifiable as they share the same prefix (i.e. NAME or LOCATIONin the above examples) followed by an underscore _.
When creating the redacted text, the de-identification service will prefer to use the most specific label in a hierarchy instead of the root label. For example, I live in Canada will be redacted as I live in [LOCATION_COUNTRY] instead of the more generic I live in [LOCATION]. This behaviour improves the usability of the data. If you are not interested in getting this level of granulity you can leverage the fact that labels in hierarchies use the same prefix. This can easily be done as a post-processing step where tokens like NAME_GIVEN and NAME_FAMILY are replaced with the root label NAME.
Filters
This guide assumes that you have a working knowledge of regular expressions. Limina is using the Python regular expression syntax. You can find more details on the Python
re module documentation.An example Python script showing how to use filters with Limina’s Python client can be found here.
Allow Filter
How can you redact regular phone numbers while keeping companies’ toll-free numbers in clear? How would you prevent document ID numbers from being detected as a sensitive numerical number? These are two examples of use cases that can be addressed with Allow filters. Allow filters instruct the detection engine to ignore entities when the entity text match a specific pattern. To create an Allow filter, a regular expression pattern is first created. This pattern is then added to thefilter list in the entity_detection object of your REST request. Let’s look at a couple of examples.
Allow List
It is possible to feed lists of terms as well as regex patterns to filters. For example, if you want to prevent the detection engine from removing country names you can whitelist them with:Request Body Entity Detection
Allowing toll-free numbers
This is an example of aprocess/text request containing an Allow filter. When run this request will detect and redact phone numbers unless they follow the specific format for toll-free phones:
Request Body
Escaping Regular ExpressionsMany regular expressions contain slashes
\ and other special characters. It is important to note that the slash \ is a reserved character in json. As such, slashes in string must be escaped as \\ to retain its original meaning. As demonstrated in the example above, the regular expression r"(1-)?(800|888|877|866|855|844|833)-\d{3}-\d{4}$" was escaped to "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$" in the json request body.Allowing IDs
Let’s look at a different example. Suppose that you are de-identifying contracts of the form:Example Text
CCT-2022-09-12321 in the document header is sensitive. The sensitivity may depend for example on other information being publicly available. In this case, the detection engine will flag the IDs. However, if you know that these numbers are not sensitive you may prefer to instruct the detection engine to allow such entities:
Request Body
process/text response to the above request:
CCT-2022-09-12321 would be redacted as NUMERICAL_PII but as expected it was not redacted thanks to the Allow filter.
Block Filter
Let’s say that you want to detect some codes or ids sharing a common format in your data. You can rely on the de-identification service to perform the redaction for you, but it may sometimes be preferable to create your own detection logic and provide a specific label for these entities. This is exactly what block filters are for.Block List
Similar to the allow list or whitelist, you can create a block list or blacklist to ensure that some common keywords are always detected and removed like so:Request Body Entity Detection
Blocking IDs
Let’s look at our contract example above. With the help of block filters, you can redact the contract id asCONTRACT_ID in the document above:
Request Body
process/text response to the above request:
Augmenting existing entity type
In this example, we are detecting ICD-10 numbers and adding these entities to the existingCONDITION entity type:
Request Body
process/text response to the above request:
CONDITION entity type.
Allow Text Filter (new in 3.7)
Allow text filters are similar to Allow filters but instead of allowing individual entities, they “mark” sections of your document as safe so that no entities are detected and nothing is redacted or de-identified. Let’s consider a simple example.Allowing a section of a document
Suppose that you have a document which contains a References section with public information only:Example Text
Example Processed Text
Request Body
Example Processed Text
Using capturing groups
Capturing groups are a very useful feature of regular expressions. By adding capturing groups to your regular expression, you can effectively dissect a matched text into the sections of interest. Consider this document including an audit trail with the editor name and the date of the changes:Example Text
Request Body
([^\]]*) in the second part of the pattern. This group is selecting the date, that is, the section of text from the colon : up to the closing square bracket ]. This informs the Allow Text filter that only this section has to be allowed. This produces this processed text:
Example Processed Text
Quick summaryLimina uses the
re Python package to evaluate regular expressions.As we have seen so far, the difference between ALLOW and ALLOW_TEXT can be summarized as follows:ALLOW checks whether a regex pattern match a detected entity value. If the entity text matches the pattern, the entity is ignored and not redacted.
ALLOW_TEXT checks whether detected entities are part of any text fragments returned by re.find with the provided regex pattern. If the entity text is part of a text fragment that matches the pattern, the entity is ignored and not redacted.