Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Per tag field (level of no_reply and info) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see no-reply@contract.fit somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply will be returned.

“+rule”

The simplest version of a rule is specified in +rule. The value with this key is a list of strings prefixed with L: or D:. This list will be concatenated into one regex.

...

D stands definition which prefixes a variable.

Specifying the search space

The search space is the area of the document the rules apply to, ie. where they look for a matching result.

...

  1. preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).

  2. search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (more info here).

  4. Granularity

Search_in

In the example given above (“is the sender no-reply@contract.fit or info@contract.fit ?”) our rules applied to the entire document. This is too broad for this use-case. Instead we would like to limit our search to the email field which specifies the sender, the email_from field. We can do this by adding the following:

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"]
  "where_to_search": {"search_in": ["email_from"]}
}

Limits
Anchor
limits
limits

You can apply different limits to the search space of your query.

...

if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdef”

Anchor
granularity
granularity
Granularity

The granularity allows you to specify in which blocks of text we want to search.

...

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Logical additions to “+rule”

In some cases matching just regexes - which already have and, or and not operators built-in - does not accomodate your rules anymore. To satisfy your needs we give you the ability to combine different regexes with higher level and-or-not operators and allowed to extend them with different low level operators like the +rule. A not operator you can specify by replacing the + in front of your operator by a -.

...

Code Block
languagejson
{
  "confidence": 97,
  "+and": [
    {
      "+rule": ["L:no-reply@contract.fit"],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
    },
    {
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]
      }
    }
  ]
}

Variables
Anchor
variables
variables

When writing rules you might come across the case where you have added the same regex in many rules. With variables you can avoid this problem. Specify them on the highest level in a dictionary and use them with the prefix D: in your rules. Re-using the telephone number, our full dictionary would now look like this:

Code Block
languagejson
{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:no-reply@contract.fit"],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
          },
          {
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]
            }
          }
        ]
      }
    ]
  }
}

Lemma
Anchor
lemma
lemma

This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.

Note

IMPORTANT: Regex is quite a bit more efficient than the and/or operators. Try to use regexes as much as possible.

Info

Note that when using different operators the where_to_search will be passed down. If on a lower level one is found, that one will be used.

This way you can:

  • Specify a granularity that applies to different and/or rules

  • Limit the search space for different and/or rules without having to define the where_to_search multiple times

FAQ

Expand
titleHow can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

Code Block
languagejson
"rules": [
  {
      "confidence": 97,                     
      "+rule": ["L:no-reply@contract.fit"]     
      "where_to_search":
        {
          "search_in": ["email_subject"]
        }
  }
]

...