Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Ever wanted to add a field that does not exist in our out-of-the-box models and you thought, “We don’t need a smart model to do this. I just need to look at these certain words.” ? For this use-case we provide a predictor you can control through the predictor settings.

First add a tag field to your format. We choose the name email_coming_from with the tag options no_reply and info. Then, add your specified rule as a regex in the predictor settings. On the swagger page of your environment you can find the endpoint /predictor_settings/{scope}. The scope is the inbox/project for which you would want this predictor to run. Inside key_value_pairs::rule_config you can specify per tag_field which regexes you want to match for a given field_name. Note that the field_name and tag options have to match exactly (case sensitive). Otherwise the prediction will be empty. We created these rules to match no-reply@contract.fit and info@contract.fit respectively:

{
  "key_value_pairs": {
    "rule_config": {
      "email_coming_from": {
        "no_reply": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:no-reply@contract.fit"]
            }
          ]
        },
        "info": {
          "rules": [
            {
              "confidence": 97,
              "+rule": ["L:info@contract.fit", "L:"]                
            }
          ]
        }
      }
    }
  }
}

Per tag field (level of no_reply and info) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see no-reply@contract.fit somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply will be returned.

Specifying the search space

The rule we have specified now will look in the whole document, this is way too broad. We only want to search inside the email field which specifies the from. For this we can use the where_to_search option specified on the same level of the +rule. To specify where and how to search there are 4 options. All of these are optional:

  1. preprocess_text: This is a bool specifying if the text needs to be cleaned.

  2. search_in: This field specifies which part of the text we look in. There are 5 options which can be combined. When left empty we will look in all these options:

    1. email_from

    2. email_to

    3. email_subject

    4. email_body

    5. attachment

  3. limits: This field specifies which limits we apply to our search_space (add link to further here).

  4. granularity: This field specifies what the granularity for a match should be (add link to further here).

In the case of our example the rule would now look like this:

{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"]
  "where_to_search": {"search_in": ["email_from"]}
}

Limits

You can apply different limits to the place you are searching in. For this we use the notion of a python slice. To specify the part of the full object you want, you need to specify a list of slices. This syntax of a slice is as follows:

[start, stop]    # items from start through stop-1
[-start:-stop]   # items from start (counting from end) through stop-1 (counting from end)
[start]          # items from start through end (only allowed for the last slice)
[-start]         # items frpm start (counting from end) through end (only allowed for the last slice)

These are the 5 options in limits:

  1. document_types: list of document types which can be combined.

  2. pages: list of slices to specify which pages you want to search in.

  3. lines: list of slices to specify which lines you want to search in.

  4. characters: list of slices to specify which characters you want to search in.

If we go back to our example and only want to look in the first 10 and last 20 characters of the email_from our rule would now look like this:

{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": [[0,10], [-20]]}
}

Granularity

The granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Other operators next to “+rule”

In some cases matching just regexes - which already have and, or and not operators built-in - does not accomodate your rules anymore. To satisfy your needs we give you the ability to combine different regexes with higher level and-or-not operators and allowed to extend them with different low level operators like the +rule. A not operator you can specify by replacing the + in front of your operator by a -.

This is the full list of operators with a dash between the higher level and low level operators:

  • +and: list of higher or lower level operators which should all match

  • -and: not version of and

  • +or: list of higher or lower level operators for which one should match

  • -or: not version of or


  • +rule: list of strings starting with L: or D: . When evaluating they will be appended into one regex

  • -rule: not version of rule

  • +lemma: list of strings which are contained in the lemma (add link here)

  • -lemma: not version of lemma

Let’s say we want to match the emails coming from no-reply@contract.fit if and only if in the body we don’t see a phone number. Our rule would now look like this:

{
  "confidence": 97,
  "+and": [
    {
      "+rule": ["L:no-reply@contract.fit"],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
    },
    {
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]
      }
    }
  ]
}

Lemma

This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.

TODO: add example

IMPORTANT: Regex is quite a bit more efficient than the and/or operators. Try to use regexes as much as possible.

Note that when using different operators the where_to_search will be passed down. If on a lower level one is found, that one will be used.

This way you can:

  • Specify a granularity that applies to different and/or rules

  • Limit the search space for different and/or rules without having to define the where_to_search multiple times

Variables

When writing rules you might come across the case where you have added the same regex in many rules. With variables you can avoid this problem. Specify them on the highest level in a dictionary and use them with the prefix D: in your rules. Re-using the telephone number, our full dictionary would now look like this:

{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:no-reply@contract.fit"],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
          },
          {
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]
            }
          }
        ]
      }
    ]
  }
}

FAQ

 How can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

"rules": [
  {
      "confidence": 97,                     
      "+rule": ["L:no-reply@contract.fit"]     
      "where_to_search":
        {
          "search_in": ["email_subject"]
        }
  }
]
 How can I test if my regex has the right format?

In the front-end you can go to a certain page and select the text view (see attached image). This way you can copy the text you are searching in. On this site you can test the regex you created: https://regex101.com/. PS: Don’t forget to set the language to python on the left hand side of the screen and remove the double escaping.

 How can I add a default option for the case that rules did not match anything?

By adding this option in the tag field and making a new rule in the predictor settings. Note that the confidence should be slightly lower than the lowest confidence of the other rules since the . will match anything. An example rule would look as follows:

{
  "email_coming_from": {
    "other": {
      "rules": [
        {
          "confidence": 96,
          "+rule": [
            "L:."
          ]
        }
      ]
    }
  }
}
  • No labels