Connect with one of our privacy experts to run this code.
- Part 1: Configuring a mask covers the basics of setting a mask to meet your preferences.
- Part 2: Configuring a marker covers the basics of setting the marker format to meet your preferences.
- Part 3: Using synthetic PII explains how to replace the original PII with synthetic values.
- Part 4: Custom redaction using the NER Text route presents an approach to create a fully customized redacted output using the NER route.
processed_text, part of the API’s request. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.
When redacting or de-identifying text, a customizable string pattern is used to replace the detected PII in the text. Limina supports these replacement options:
MASKcontaining repeated characters up to the length of the replaced entity. This option provides a redacted text containing no information about the actual entities that were replaced.MARKERcontaining the type of the entity being replaced. Markers can also be configured to link different mentions of the same entities in the redacted text (i.e., a name that appear twice in the text will have the same unique replacement marker).SYNTHETICtext containing an AI generated replacement for the original entity. This option provides a processed text that is very similar to the original input text except that sensitive PII has been replaced with fake values.
Configuring a mask
Masking is also known as hashing when the# character is used. Setting the mask option is as simple as setting the type MASK in the processed_text object.
Request Body
#. The redacted text will then look like:
Request Body
Configuring a marker
The marker option allows the redacted text to include the entity type. This may improve the readability of the redacted text. Setting the default marker option is as simple as setting the mask option.Request Body
Nessa Jonsson and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N have been replaced with the same marker index NAME_1. Similarly, two mentions of Icarus have been replaced with ORGANIZATION_2. When creating the markers with the default settings, the de-identification service will use a unique marker index unless the entity was previously seen in the text. If the entity is repeated more than once in the text, the service will do its best to assign the same unique marker. Read more about keeping the relationship between entities in the Coreference Resolution Section below.
You can also create your own marker by providing a format containing one of the marker keywords below (e.g., [ENTITY_TYPE]):
| Marker keywords | Description |
|---|---|
ENTITY_TYPE | Replace the entity with the type that best describes the entity (e.g., John -> NAME_GIVEN) |
ALL_ENTITY_TYPES | Replace the entity with all the labels that applies (e.g., John -> NAME_GIVEN,NAME) |
UNIQUE_ENTITY_TYPE (default) | Replace the entity with the type that best describes the entity and append a number so that different entities have different markers (e.g. John -> NAME_GIVEN_1, Mary -> NAME_GIVEN_2) |
UNIQUE_HASHED_ENTITY_TYPE | Similar to UNIQUE_ENTITY_TYPE except that to make the markers unique the service appends a hash value instead of an sequential integer. |
What is coreference resolution
Coreference resolution is a natural language processing (NLP) task that consists of locating and associating different mentions of real-world entities in unstructured text. Let’s look at an example.Text Example
Limina and coreference resolution (new in 4.0)
Limina’s de-identification service offers the ability to use coreference resolution on itsprocess/text & analyze/text endpoint. The current implementation of coreference resolution is done on top of the named entity recognition (NER) model. It is, therefore, limited to returning coreference between entities only.
Limina offers three different methods of performing coreference resolution. Whether you need to feed the redacted text to a ML model or simply make it easier to identify the different mentions of entities, you can benefit from Limina’s coreference resolution support.
| Method Name | Description | Speed | Limitations |
|---|---|---|---|
| Heuristics | Uses rule-based methods for linking entities based on string matching. | Fast | Mostly links exact matches and a few minor variations (e.g., difference in casing). It may miss more complex variations and typos. |
| Model Prediction | Uses a neural network model to resolve coreferences, allowing for variations. | Slower | Currently only supports NAME and ORGANIZATION entities in English. This method is much slower than the heuristics one. |
| Combined | Combines both heuristics and model prediction for better coverage. | Slowest | Supports all entities with the heuristics method but resolves more complex cases for NAME and ORGANIZATION with the model prediction method. Slower than heuristics. |
analyze/text endpoint, refer to the analyze-text documentation.
Here is an example of how to set coreference resolution in your de-identification request using the process/text endpoint. Notice the coreference_resolution field part of the processed_text object.
Request Body
coreference_resolution field can take one of three values: heuristics, model_prediction or combined. Note that coreference resolution is enabled whenever a unique marker is used (i.e., UNIQUE_NUMBERED_ENTITY_TYPE or UNIQUE_HASHED_ENTITY_TYPE). By default, the heuristics mode will be enabled.
The following sections describe each of these options.
Heuristics
This method of coreference resolution is based solely on string matching. It is therefore only capable of linking entities that are mentioned in the same way. For example, the entitiesMary and mary will be linked together because the strings match, except for a small difference in casing.
However, the entities John A. Smith and Mr Smith will not be linked together in this mode even if they are referring to the same person. As a consequence, the two entities will be assigned different unique markers (e.g., NAME_1 and NAME_2) in the redacted text.
Here is the output of the example above using the heuristics mode:
ORGANIZATION_2 marker). The first mention is too different from the two other mentions for it to be resolved using heuristics. However, the two mentions of the person: Nessa Jonsson and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N were correctly linked despite the differences between the strings.
While the heuristics mode has its limitations, it is great when a more predictable output is required (e.g., all exact mentions of an entity need to be linked together in a text, no matter how long or difficult the text is). This option is also the fastest one and all entity types in addition to NAME and ORGANIZATION are supported.
The heuristics mode is currently the default one in the process/text endpoint.
Model prediction (new in 4.0)
Themodel_prediction option was introduced by Limina to work around some of the limitations of the heuristics mode of resolution. This option uses a neural network model to resolve coreferences. It is capable of resolving mentions that have different spellings or even mentions containing typos.
coreference_resolution field has changed. As you can see, the model_prediction mode is capable of linking Icarus Airways Customer Service with Icarus, leading to a redacted text in which all mentions of the Icarus organization are replaced with the same marker (i.e., ORGANIZATION_1). However, the model was unabled to link the person’s name Nessa Jonsson with the spelled-out form.
The model_prediction option is great if you are dealing with entity mentions that may contain variation or typos. The model_prediction option currently only resolves NAME and ORGANIZATION entities in English text. Note also that this option is much slower than the heuristics one. It is not recommended for text samples that contain more than a few hundred words.
Combined (new in 4.0)
As its name suggests, this option combines the two other coreference resolution modes.combined mode is able to resolve all coreferential mentions in the above example. If you plan to process only English text and want the best coverage to identify and resolve coreferential mentions of all types, then this mode is for you!
Note that the combined mode suffers from the same limitations as the model_prediction mode. It is much slower than the heuristics mode alone and it is not recommended when processing large volumnes of text (e.g., several thousand words).
:::note coreference resolution across multiple requests
Note that the coreference resolution is only performed within a single request. Identical entities across different requests will usually be assigned different markers. If you are processing related text fragments, you may consider passing them as a batch in a single request and setting link_batch to True. This will allow the de-identification service to link entities across these fragments.
:::
Synthetic PII
You may choose to replace the entities in your text with fake or synthetic entities instead of markers and masks. There are a few reasons to do so. For example, if you train an AI model on your data, synthetic replacements might provide a more realistic input to train your model. Generating synthetic PII is done by settingprocessed_text.type to SYNTHETIC.
Request Body
synthetic_entity_accuracy field. For English generation, set this parameter to standard for best results. For other languages, set it to standard_multilingual and the synthetic model will attempt to predict entities matching the input text language. The default accuracy is standard_automatic which will determine the appropriate model (i.e., standard or standard_multilingual) from the input language.
Request Body
Custom redaction using the NER Text route
As we have seen above, the Process Text route offers a lot of flexibility in how text and files are redacted. In the event that you have a specific use case that is not completely covered by the API, it is possible to create your own custom redaction function. This section shows how the NER Text route that was introduced in3.9 can be used to create a custom redaction function with more “fine-grained” labels.
Process Text route redaction
Let’s say that you want to redact this fragment of text:Text Example
Redacted Text
ERIC G. BADORREK including the first name, initial and last name were combined into a single NAME marker. This grouping of words into a single marker is even more apparent on the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. which is redacted as a single LOCATION_ADDRESS label. This is certainly making the redacted contents more readable but it is hiding some information that may be useful for your use case. For example, you might want to know if the provided address was containing a zip code or a country which is impossible to determine from the current redacted output.
Using the NER Text route to create your own redacted content
Unlike the Process Text route, the NER Text route does not provide a redacted output. However, the entities it returns can be used to create one. Let’s see how. Consider this piece of code which is processing the same sample text but with the NER Text route this time.Full Python Code Example
Python Code Example
NotComparable, is created to ensure that identical strings are not comparable (i.e., NotComparable("e") != NotComparable("e")). This will be useful when outputting the redacted text.
Python Code Example
groupby function to only output it once. This is where the NotComparable utility plays its role by preventing consecutive identical characters (e.g., the two R in BADORREK) to be grouped together.
Python Code Example
Redacted Text
ERIC G. BADORREK has been replace with [NAME_GIVEN][NAME][NAME_FAMILY] instead of a single NAME marker and how the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. was redacted with much more details [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. From the above redacted text, it becomes clear that the original address contained a city, a state and a country but no zip code.
A parting note about privacyYou may wonder if the redacted results achieved in this section could have been obtained in a simpler way by disabling the
NAME, LOCATION and LOCATION_ADDRESS entity types when making the request. While disabling entity types has its use, the technique described above has the advantage of lowering the chances of leaking sensitive data.Consider for example the words Sussex County part of the provided address. These words are part of the LOCATION_ADDRESS but not part of any other sub-entities. As a result, these words would be left unredacted if both LOCATION and LOCATION_ADDRESS were disabled. This is applicable to many other entities like Mount Everest which is a LOCATION but does not match any other location sub-entities. By disabling the LOCATION label, we let these entities unredacted. This might not be desirable for some use cases.