Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the example below we describe the implementation for the field email_coming_from with the tag options no_reply and info, so they will match emails sent from no-reply@contractnoreply@contract.fit and info@contract.fit respectively.

...

Code Block
languagejson
{
  "key_value_pairs": {
    "rule_config": {
      "email_coming_from": {
        "no_reply": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:no-reply@contractnoreply@contract.fit"]
            }
          ]
        },
        "info": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:info@contract.fit"]                
            }
          ]
        }
      }
    }
  }
}

Per tag field (level of no_reply and info) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see no-reply@contractnoreply@contract.fit somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply will be returned.

...

This function contains 4 options, all are optional:

  1. preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).

  2. search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (more info here).

  4. Granularity

Search_in

In the example given above (“is the sender no-reply@contractnoreply@contract.fit or info@contract.fit ?”) our rules applied to the entire document. This is too broad for this use-case. Instead we would like to limit our search to the email field which specifies the sender, the email_from field. We can do this by adding the following:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contractnoreply@contract.fit"]
  "where_to_search": {"search_in": ["email_from"]}
}

...

These are the 4 options to limits:

  1. document_types: list of document types, these can be combined.

  2. pages: list of slices to specify which pages you want to search in.

  3. lines: list of slices to specify which lines you want to search in.

  4. characters: list of slices to specify which characters you want to search in.

  5. email_chains: dictionary of an integer as key (specifying what chain to look in) and a list of slices as value.

For option 2, 3, 4 and 4 5 we use the method of a Python Slice. They require a specified start and stop value to be defined. Start and stop can be integers, to define an index, or floats to define a percentage of a slice. Start and stop need to have the same type.

...

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contractnoreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0,10], [-10]]
    }
  }
}

...

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contractnoreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "characters": [[0., 0.2]]
    }
  }
}

if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdef”

Example 3: Imagine that in our example we have chain of emails sent back and fort, with the first email being forwarded from noreply@contract.fit:

View file
nameRe_ Don't reply to this email.eml

If we don’t specify any limits on our email chains, it will look in all and find an email coming from noreply@contract.fit. We only care who sent the first email however. So we specify a limit on the email chains which limits the email search space to the first email in the first chain of emails.

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:noreply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": {
      "email_chains": {"0": [[0, 1]]}
    }
  }
}

Anchor
granularity
granularity
Granularity

The granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Logical additions to “+rule”

...

Let’s say we want to match the emails coming from no-reply@contractnoreply@contract.fit if and only if in the body we don’t see a phone number. Our rule would now look like this:

Code Block
languagejson
{
  "confidence": 97,
  "+and": [
    {
      "+rule": ["L:no-reply@contractnoreply@contract.fit"],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
    },
    {
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]
      }
    }
  ]
}

...

Code Block
languagejson
{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:no-reply@contractnoreply@contract.fit"],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
          },
          {
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]
            }
          }
        ]
      }
    ]
  }
}

Lemma
Anchor
lemma
lemma

Lemmatisation is the process of giving all words the same form. More information can be found here: https://en.wikipedia.org/wiki/Lemmatisation. This is a very powerful way to improve the matching capabilties of regexes. This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.

...

Expand
titleHow can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

Code Block
languagejson
"rules": [
  {
      "confidence": 97,                     
      "+rule": ["L:no-reply@contractnoreply@contract.fit"]     
      "where_to_search":
        {
          "search_in": ["email_subject"]
        }
  }
]

...