Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the example below we describe the implementation for the field email_coming_from with the tag options no_reply and info, so they will match emails sent from no-reply@contractnoreply@contract.fit and info@contract.fit respectively.

The field Fields and tags need to be added in a specified rule as a regex in the predictor settings. On the swagger page of your environment you can find the endpoint /predictor_settings/{scope}. The scope is the inbox/project for which you would want this predictor to run. Inside key_value_pairs::rule_config you can specify per tag_field which regexes you want to match for a given field_name. Note that the field_name and tag options have to match exactly (case sensitive). Otherwise the prediction will be empty.

Code Block
languagejson
{
  "key_value_pairs": {
    "rule_config": {
      "email_coming_from": {
        "no_reply": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:no-reply@contractnoreply@contract.fit"]
            }
          ]
        },
        "info": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:info@contract.fit"]                
            }
          ]
        }
      }
    }
  }
}

Per tag field (level of no_reply and info) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see no-reply@contractnoreply@contract.fit somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply will be returned.

“+rule”

...

Specifying the search space

The rule we have specified now will apply to the whole document, this is way too broad. We only want to search inside the email field which specifies the from. For this we can use the where_to_search option specified on the same level of the +rule. To specify where and how to search there are 4 options. All of these are optional:

  1. preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).

  2. search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (add link to further here).

  4. granularity: This field specifies what the granularity for a match should be (add link to further here).

In the case of our example, we only want to look in the email_from, so the rule will look like this:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"]
  "where_to_search": {"search_in": ["email_from"]}
}

Limits

You can apply different limits to the search space of your query.

These are the 4 options to limits:

  1. document_types: list of document types which can be combined.

  2. pages: list of slices to specify which pages you want to search in.

  3. lines: list of slices to specify which lines you want to search in.

  4. characters: list of slices to specify which characters you want to search in.

For option 2, 3 and 4 we use the method of a Python Slice. They require a specified start and stop value to be defined.

Code Block
[start, stop]    # items from start through stop-1
[-start:-stop]   # items from start (counting from end) through stop-1 (counting from end)
[start]          # items from start through end (only allowed for the last slice)
[-start]         # items frpm start (counting from end) through end (only allowed for the last slice)

Eg. Imagine that in our example we only want to look in the first 10 and last 20 characters of the email_from. In this case we would change our rule as follows:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0,10], [-20]]
    }
  }
}

Granularity

The granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Other operators next to “+rule”

In some cases matching just regexes - which already have and, or and not operators built-in - does not accomodate your rules anymore. To satisfy your needs we give you the ability to combine different regexes with higher level and-or-not operators and allowed to extend them with different low level operators like the +rule. A not operator you can specify by replacing the + in front of your operator by a -.

This is the full list of operators with a dash between the higher level and low level operators:

  • +and: list of higher or lower level operators which should all match

  • -and: not version of and

  • +or: list of higher or lower level operators for which one should match

  • -or: not version of or

  • +rule: list of strings starting with L: or D: . When evaluating they will be appended into one regex

  • -rule: not version of rule

Let’s say we want to match the emails coming from no-reply@contract.fit if and only if in the body we don’t see a phone number. Our rule would now look like this:

...

languagejson

...

The simplest version of a rule is specified in +rule. The value with this key is a list of strings prefixed with L: or D:. This list will be concatenated into one regex.

L stands for literal and prefixes a normal regex. Note that the regex should be double escaped, so the regex for digit becomes \\d instead of \d.

D stands definition which prefixes a variable.

Specifying the search space

The search space is the area of the document the rules apply to, ie. where they look for a matching result.

To be more precise about this area, we can use the where_to_search function.

This function contains 4 options, all are optional:

  1. preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).

  2. search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (more info here).

  4. Granularity

Search_in

In the example given above (“is the sender noreply@contract.fit or info@contract.fit ?”) our rules applied to the entire document. This is too broad for this use-case. Instead we would like to limit our search to the email field which specifies the sender, the email_from field. We can do this by adding the following:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:noreply@contract.fit"]
  "where_to_search": {"search_in": ["email_from"]}
}

Limits
Anchor
limits
limits

You can apply different limits to the search space of your query.

These are the 4 options to limits:

  1. document_types: list of document types, these can be combined.

  2. pages: list of slices to specify which pages you want to search in.

  3. lines: list of slices to specify which lines you want to search in.

  4. characters: list of slices to specify which characters you want to search in.

  5. email_chains: dictionary of an integer as key (specifying what chain to look in) and a list of slices as value.

For option 2, 3, 4 and 5 we use the method of a Python Slice.

Anchor
slice
slice
They require a specified start and stop value to be defined. Start and stop can be integers, to define an index, or floats to define a percentage of a slice. Start and stop need to have the same type.

Code Block
[start, stop]    # items from start through stop-1
[-start:-stop]   # items from start (counting from end) through stop-1 (counting from end)
[start]          # items from start through end (only allowed for the last slice)
[-start]         # items from start (counting from end) through end (only allowed for the last slice)

Example 1: Imagine that in our example we only want to look in the first 10 and last 10 characters of the email_from. In this case we would change our rule as follows:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:noreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0,10], [-10]]
    }
  }
}

if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdefghijqrstuvwxyz”

Example 2: Imagine that in our example we only want to look in the first 20% characters of the email_from. In this case we would change our rule as follows:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:noreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0., 0.2]]
    }
  }
}

if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdef”

Example 3: Imagine that in our example we have chain of emails sent back and fort, with the first email being forwarded from noreply@contract.fit:

View file
nameRe_ Don't reply to this email.eml

If we don’t specify any limits on our email chains, it will look in all and find an email coming from noreply@contract.fit. We only care who sent the first email however. So we specify a limit on the email chains which limits the email search space to the first email in the first chain of emails.

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:noreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "email_chains": {"0": [[0, 1]]}
    }
  }
}

Anchor
granularity
granularity
Granularity

The granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Logical additions to “+rule”

In some cases matching just regexes - which already have and, or and not operators built-in - does not accomodate your rules anymore. To satisfy your needs we give you the ability to combine different regexes with higher level and-or-not operators and allowed to extend them with different low level operators like the +rule. A not operator you can specify by replacing the + in front of your operator by a -.

This is the full list of operators with a dash between the higher level and low level operators:

  • +and: list of higher or lower level operators which should all match

  • -and: not version of and

  • +or: list of higher or lower level operators for which one should match

  • -or: not version of or

...

  • +rule: list of strings starting with L: or D: . When evaluating they will be appended into one regex.

  • -rule: not version of rule

  • +lemma: list of strings which are contained in the lemma: more info here.

  • -lemma: not version of lemma

Let’s say we want to match the emails coming from noreply@contract.fit if and only if in the body we don’t see a phone number. Our rule would now look like this:

Code Block
languagejson
{
  "confidence": 97,
  "+and": [
    {
      "+rule": ["L:noreply@contract.fit"],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
    },
    {
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]
      }
    }
  ]
}

Variables
Anchor
variables
variables

When writing rules you might come across the case where you have added the same regex in many rules. With variables you can avoid this problem. Specify them on the highest level in a dictionary and use them with the prefix D: in your rules. Re-using the telephone number, our full dictionary would now look like this:

Code Block
languagejson
{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:noreply@contract.fit"],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
          },
          {
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]
            }
          }
        ]
      }
    ]
  }
}

Lemma
Anchor
lemma
lemma

Lemmatisation is the process of giving all words the same form. More information can be found here: https://en.wikipedia.org/wiki/Lemmatisation. This is a very powerful way to improve the matching capabilities of regexes. After "+lemma" you can add a list of lemmatised words that can be in the file. If one matches, your rule will match. Because of the improved matching capabilities it is advised to reduce the search space by looking for strings that are contained in the same granularity.

Let’s say we would want to find if our email is coming from noreply by looking what is written in the email body. For our rule we specify a granularity of sentence meaning that all matches should be in the same sentence. Our rule would look like this:

Code Block
{
    "no_reply": {
        "rules": [
            {
                "search_in+and": ["email_body"]
           }     }   ]
}

Lemma

This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.

  • +lemma: list of strings which are contained in the lemma (add link here)

  • -lemma: not version of lemma

TODO: add example

Note

IMPORTANT: Regex is quite a bit more efficient than the and/or operators. Try to use regexes as much as possible.

Info

Note that when using different operators the where_to_search will be passed down. If on a lower level one is found, that one will be used.

This way you can:

  • Specify a granularity that applies to different and/or rules

  • Limit the search space for different and/or rules without having to define the where_to_search multiple times

Variables

When writing rules you might come across the case where you have added the same regex in many rules. With variables you can avoid this problem. Specify them on the highest level in a dictionary and use them with the prefix D: in your rules. Re-using the telephone number, our full dictionary would now look like this:

Code Block
languagejson
{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:no-reply@contract.fit"], {"+lemma": ["not"]},
                    {"+lemma": ["reply", "answer"]}
                ],
                "confidence": 80,
                "where_to_search": {"search_in": ["email_body"], "granularity": "sentence"}
            }
        ]
    }
}

If someone now writes in the email body. “I don’t expect you to reply.” or “I do not want you answering this email”. It will match the rule

Note

IMPORTANT: Regex is quite a bit more efficient than the and/or operators. Try to use regexes as much as possible.

Info

Note that when using different operators the where_to_search will be passed down. If on a lower level one is found, that one will be used.

This way you can:

  • Specify a granularity that applies to different and/or rules

  • Limit the search space for different and/or rules without having to define the where_to_search multiple times

Tag example

Code Block

Extraction example

Code Block

Document type example

Code Block
{
  "type": "document_type",            
  "where_to_searchrules": {
   [
          "search_in": ["email_from"],{
              "limitsgen_id": [[0,10], [-20]]}"Bat&BallHotel",
           },         "confidence": 100,
 {             "-rule_type": ["D:var1first"],
              "where_to_search+and": [
{                  {"search_in+rule": ["email_body"]L:(?i)The Bat & Ball Hotel"]},
            }      {"+rule": ["L:(?i)Order"]}
     }         ]
          }
      ]

 } }

FAQ

Expand
titleHow can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

Code Block
languagejson
"rules": [
  {
      "confidence": 97,                     
      "+rule": ["L:no-reply@contractnoreply@contract.fit"]     
      "where_to_search":
        {
          "search_in": ["email_subject"]
        }
  }
]

...