Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


In the example below we describe the implementation for the field email_coming_from with the tag options no_reply and info, so they will match emails sent from and respectively.


Code Block
  "key_value_pairs": {
    "rule_config": {
      "email_coming_from": {
        "no_reply": {
          "rules": [
              "confidence": 97,
              "+rule": [""]
        "info": {
          "rules": [
              "confidence": 97,
              "+rule": [""]                

Per tag field (level of no_reply and info) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply will be returned.


This function contains 4 options, all are optional:

  1. preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).

  2. search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (more info here).

  4. Granularity


In the example given above (“is the sender or ?”) our rules applied to the entire document. This is too broad for this use-case. Instead we would like to limit our search to the email field which specifies the sender, the email_from field. We can do this by adding the following:

Code Block
  "confidence": 97,
  "+rule": [""]
  "where_to_search": {"search_in": ["email_from"]}


These are the 4 options to limits:

  1. document_types: list of document types, these can be combined.

  2. pages: list of slices to specify which pages you want to search in.

  3. lines: list of slices to specify which lines you want to search in.

  4. characters: list of slices to specify which characters you want to search in.

  5. email_chains: dictionary of an integer as key (specifying what chain to look in) and a list of slices as value.

For option 2, 3, 4 and 4 5 we use the method of a Python Slice. They require a specified start and stop value to be defined. Start and stop can be integers, to define an index, or floats to define a percentage of a slice. Start and stop need to have the same type.


Code Block
  "confidence": 97,
  "+rule": [""],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0,10], [-10]]


Code Block
  "confidence": 97,
  "+rule": [""],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0., 0.2]]

if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdef”

Example 3: Imagine that in our example we have chain of emails sent back and fort, with the first email being forwarded from

View file
nameRe_ Don't reply to this email.eml

If we don’t specify any limits on our email chains, it will look in all and find an email coming from We only care who sent the first email however. So we specify a limit on the email chains which limits the email search space to the first email in the first chain of emails.

Code Block
  "confidence": 97,
  "+rule": [""],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "email_chains": {"0": [[0, 1]]}


The granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Logical additions to “+rule”


Let’s say we want to match the emails coming from if and only if in the body we don’t see a phone number. Our rule would now look like this:

Code Block
  "confidence": 97,
  "+and": [
      "+rule": [""],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]


Code Block
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    "rules": [
        "confidence": 97,
        "+and": [
            "+rule": [""],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]


Lemmatisation is the process of giving all words the same form. More information can be found here: This is a very powerful way to improve the matching capabilties of regexes. This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.


titleHow can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

Code Block
"rules": [
      "confidence": 97,                     
      "+rule": [""]     
          "search_in": ["email_subject"]
