> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getlimina.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Processing Structured Data Sources

> This guide illustrates how to use Limina on Structured Data sources such as Databases, JSON, CSV and XML files

This guide illustrates how to use Limina on Structured Data sources such as Databases, JSON, CSV and XML files. It is also applicable to Databricks and Snowflake.

<Info>
  [Connect with one of our privacy experts](https://getlimina.ai/contact-us/?utm_source=docs\&utm_medium=website) to run this code.
</Info>

## Tabular Data

<Note>
  If it is only required to redact certain Columns such as `VisitNotes` please see our [PII Safe Sentiment Analysis Walkthrough](https://getlimina.ai/blog/sentiment-analysis-anonymization/).
</Note>

In addition to raw text and files, you can use Limina to process tabular data, such as found in a database, CSV or Microsoft Excel file. For example, consider the below table:

| ClientNm       | DOfB   | HlthCareNm | TempFd | VisitNotes                                                                                                              |
| -------------- | ------ | ---------- | ------ | ----------------------------------------------------------------------------------------------------------------------- |
| Stefan Krugler | 150368 | 2342345234 | NYC    | Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in |
| Hari Seldon    | 201030 | 4947384944 | LA     | Patient came in for a regular checkup.                                                                                  |

One could try to create a rule-based system, but it is difficult:

* The column names are all acronyms, `DOfB` in particular is an uncommon way of referring to a date of birth.
* Each cell value is written as an integer to save space, removing the hint of `/` or `.` characters in identifying the date.
* `TempFd` contains the city the individual lives in, but is stored in a temporary field.
* `VisitNotes` is a varchar free text field. This field is particularly useful for ML use cases, for example to determine the main visit reasons or for case resolution.

In particular, longer string and blob fields in databases can be difficult.

### Processing As a String

A simple approach would be to join each row together and process as a single long string:

<CodeGroup>
  ```json Request Body wrap lines theme={"theme":"poimandres"}
  {
    "text": [
      "ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."
    ]
  }
  ```

  ```shell cURL wrap lines theme={"theme":"poimandres"}
  curl --location 'https://api.private-ai.com/community/v4/process/text' \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <YOUR KEY HERE>' \
  --data '{
    "text": [
      "ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."
    ]
  }'
  ```

  ```python Python wrap lines theme={"theme":"poimandres"}
  import requests

  r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                    headers={"x-api-key": "<YOUR API KEY>"},
                    json={"text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]})

  results = r.json()

  print(results)
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client import request_objects

  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  text_request = request_objects.process_text_obj(text=["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."])

  response = client.process_text(text_request)

  print(response.processed_text)
  ```
</CodeGroup>

However the result isn't great:

```shell Response Text wrap theme={"theme":"poimandres"}
['ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, [NAME_1], [LOCATION_1], Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, [NAME_MEDICAL_PROFESSIONAL_1], [LOCATION_2], Patient came in for a regular checkup.']
```

### Formatting the Data Correctly

We can get better results by formatting the input in a `"key1: value1, key2: value2, ...` pattern, to make sure that each cell entry includes the column name for context. If multiple headers exist, the format should look like: `"key1a: key1b: value1, key2: value2, ...`. Note that it's important to put a colon at the end of each key.

<CodeGroup>
  ```json Request Body wrap lines theme={"theme":"poimandres"}
  {
    "text": [
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ]
  }
  ```

  ```shell cURL wrap lines theme={"theme":"poimandres"}
  curl --location 'https://api.private-ai.com/community/v4/process/text' \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <YOUR KEY HERE>' \
  --data '{
    "text": [
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ]
  }'
  ```

  ```python Python wrap lines theme={"theme":"poimandres"}
  import requests

  r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                    headers={"x-api-key": "<YOUR API KEY>"},
                    json={"text": ["ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."]})

  results = r.json()

  print(results)
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client import request_objects

  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  text_request = request_objects.process_text_obj(text=[
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."])

  response = client.process_text(text_request)

  print(response.processed_text)
  ```
</CodeGroup>

As you can see, the result is much better; it's still susceptible to False-Positive detections on column names. Also as the result is a long string, it is more difficult to relate the predictions back to individual cells:

```shell Response Text wrap theme={"theme":"poimandres"}
['ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [HEALTHCARE_NUMBER_1], TempFd: [LOCATION_CITY_1], VisitNotes: Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, ClientNm: [NAME_2], DOfB: [DOB_2], HlthCareNm: [HEALTHCARE_NUMBER_2], TempFd: [LOCATION_CITY_2], VisitNotes: Patient came in for a regular checkup.']
```

### Best Practice: Correct Formatting With Link Batch

<Info>
  These examples process the entire table in a single API request to ensure consistent entity numbering. For larger tables it is recommended to process row by row, without numbered markers.
</Info>

We can solve both of the issues with the previous approach by feeding the table in as a list with [link\_batch](/container-quickstart/api-quickstart#processing-related-examples) on. This is functionally equivalent to the example presented above and this allows us to easily get the results of each cell without any post-processing logic.

<CodeGroup>
  ```json Request Body wrap lines highlight={5} theme={"theme":"poimandres"}
  {
    "text": [
      "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."
    ],
    "link_batch": true
  }
  ```

  ```shell cURL wrap lines theme={"theme":"poimandres"}
  curl --location 'https://api.private-ai.com/community/v4/process/text' \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <YOUR KEY HERE>' \
  --data '{
    "text": [
      "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."
    ],
    "link_batch": true
  }'
  ```

  ```python Python wrap lines theme={"theme":"poimandres"}
  import requests

  r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                    headers={"x-api-key": "<YOUR API KEY>"},
                    json={"text": ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."], "link_batch": True})

  results = r.json()

  print(results)
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client import request_objects

  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  text = ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]

  text_request = request_objects.process_text_obj(text=text, link_batch=True)
  response = client.process_text(text_request)
  print(response.processed_text)
  ```
</CodeGroup>

The result is the same, but it is now easy to get the output for each cell by looking at the corresponding response in the list:

```shell Response Text wrap theme={"theme":"poimandres"}
['ClientNm:', '[NAME_1]', ', DOfB:', '[DOB_1]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_1]', ', TempFd:', '[LOCATION_CITY_1]', ', VisitNotes:', 'Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in', 'ClientNm:', '[NAME_2]', ', DOfB:', '[DOB_2]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_2]', ', TempFd:', '[LOCATION_CITY_2]', ', VisitNotes:', 'Patient came in for a regular checkup.']
```

In addition to the redacted result, the entity list (not shown) can be used. This approach can be adjusted to column-by-column, but row-by-row usually leads to better detection performance.

The examples above process the entire table but they can easily be adapted to process specific rows and columns.

## JSON & XML

<Info>
  The approach described in this section is implemented in the `process/files` routes for [JSON](/configuration-and-operations/working-with-files/processing-files/json) and [XML](/configuration-and-operations/working-with-files/processing-files/xml).
</Info>

A similar approach applies for JSON and XML. Consider the following JSON:

```json JSON theme={"theme":"poimandres"}
{
  "VisitDates": ["15/2/2023", "8/5/2023"],
  "PhysName": "Dr Walter",
  "PersonAttributes": {"Weight": 78.5, "Height": 189},
  "MedLoc": "Mt Western Cancer Center"
}
```

It is possible to send the JSON through as text, like one might to an LLM, yet this might corrupt the JSON - a lot of pre- and post-processing logic is necessary for a production application. Instead, we can parse the JSON and arrange it into a `<key>: <value> ...` pattern like above, add colons to the last key before each value and enable [link\_batch](/container-quickstart/api-quickstart#processing-related-examples). The only difference is that a value might have multiple keys in the hierarchy, so we prepend all of those. The request payload then looks like:

<CodeGroup>
  ```json Request Body wrap lines theme={"theme":"poimandres"}
  {
    "text": [
      "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ],
    "link_batch": true
  }
  ```

  ```shell cURL wrap lines theme={"theme":"poimandres"}
  curl --location 'https://api.private-ai.com/community/v4/process/text' \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <YOUR KEY HERE>' \
  --data '{
    "text": [
      "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ]
  }'
  ```

  ```python Python wrap lines theme={"theme":"poimandres"}
  import requests

  r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                    headers={"x-api-key": "<YOUR API KEY>"},
                    json={"text": ["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"]})

  results = r.json()

  print(results)
  ```

  ```python Python Client wrap lines theme={"theme":"poimandres"}
  from privateai_client import PAIClient
  from privateai_client import request_objects

  client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

  text_request = request_objects.process_text_obj(text=["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"], link_batch=True)

  response = client.process_text(text_request)

  print(response.processed_text)
  ```
</CodeGroup>

Yields the following:

```shell Response Text wrap theme={"theme":"poimandres"}
['VisitDates:', '[DATE_1]', ', VisitDates:', '[DATE_2]', ', PhysName:', '[NAME_MEDICAL_PROFESSIONAL_1]', ', PersonAttributes: Weight:', '[PHYSICAL_ATTRIBUTE_1]', 'PersonAttributes: Height:', '[PHYSICAL_ATTRIBUTE_2]', ', MedLoc:', '[ORGANIZATION_MEDICAL_FACILITY_1]']
```
