Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejson
{
  "confidence": 97,
  "+rule": ["L:no-reply@contract.fit"],
  "where_to_search": {
    "search_in": ["email_from"],
    "limits": [[0,10], [-20]]}
}

Granularity

TODOThe granularity allows you to specify in which blocks of text we want to search.

The options are:

  1. full (default if nothing is specified)

  2. page

  3. sentence

  4. paragraph

  5. line

Other operators next to “+rule”

...

  • +and: list of higher or lower level operators which should all match

  • -and: not version of and

  • +or: list of higher or lower level operators for which one should match

  • -or: not version of or

...

  • +rule: list of strings starting with L: or D: . When evaluating they will be appended into one regex

  • -rule: not version of rule

  • +lemma: list of strings which are contained in the lemma (add link here)

  • -lemma: not version of lemma

...

Code Block
languagejson
{
  "confidence": 97,
  "+and": [
    {
      "+rule": ["L:no-reply@contract.fit"],
      "where_to_search": {
        "search_in": ["email_from"],
        "limits": [[0,10], [-20]]}
    },
    {
      "-rule": ["L:\\+32\\d{9}"],
      "where_to_search": {
        "search_in": ["email_body"]
      }
    }
  ]
}

Lemma

This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.

TODO: add example

Note

IMPORTANT: Regex is quite a bit more efficient than the and/or operators. Try to use regexes as much as possible.

Info

Note that when using different operators the where_to_search will be passed down. If on a lower level one is found, that one will be used.

This way you can:

  • Specify a granularity that applies to different and/or rules

  • Limit the search space for different and/or rules without having to define the where_to_search multiple times

...

Variables

When writing rules you might come across the case where you have added the same regex in many rules. With variables you can avoid this problem. Specify them on the highest level in a dictionary and use them with the prefix D: in your rules. Re-using the telephone number, our full dictionary would now look like this:

Code Block
languagejson
{
  "no_reply": {
    "variables": {
      "var1": ["L:\\+32\\d{9}"]
    },
    "rules": [
      {
        "confidence": 97,
        "+and": [
          {
            "+rule": ["L:no-reply@contract.fit"],
            "where_to_search": {
              "search_in": ["email_from"],
              "limits": [[0,10], [-20]]}
          },
          {
            "-rule": ["D:var1"],
            "where_to_search": {
              "search_in": ["email_body"]
            }
          }
        ]
      }
    ]
  }
}

FAQ

Expand
titleHow can I only look in the email subject for my regex?

By adding the option where_to_search::search_in to your rule. An example field would look like this:

Code Block
languagejson
"rules": [
  {
      "confidence": 97,                     
      "+rule": ["L:no-reply@contract.fit"]     
      "where_to_search":
        {
          "search_in": ["email_subject"]
        }
  }
]
Expand
titleHow can I test if my regex has the right format?

In the front-end you can go to a certain page and select the text view (see attached image). This way you can copy the text you are searching in. On this site you can test the regex you created: https://regex101.com/. PS: Don’t forget to set the language to python on the left hand side of the screen and remove the double escaping.

Expand
titleHow can I add a default option for the case that rules did not match anything?

By adding this option in the tag field and making a new rule in the predictor settings. Note that the confidence should be slightly lower than the lowest confidence of the other rules since the . will match anything. An example rule would look as follows:

Code Block
languagejson
{
  "email_coming_from": {
    "other": {
      "rules": [
        {
          "confidence": 96,
          "+rule": [
            "L:."
          ]
        }
      ]
    }
  }
}