...
In the example below we describe the implementation for the field email_coming_from
with the tag options no_reply
and info
, so they will match emails sent from no-reply@contractnoreply@contract.fit
and info@contract.fit
respectively.
...
Code Block | ||
---|---|---|
| ||
{ "key_value_pairs": { "rule_config": { "email_coming_from": { "no_reply": { "rules": [ { "confidence": 97, "+rule": ["L:no-reply@contractnoreply@contract.fit"] } ] }, "info": { "rules": [ { "confidence": 97, "+rule": ["L:info@contract.fit"] } ] } } } } } |
Per tag field (level of no_reply
and info
) you can specify a list of rules with a confidence per rule. The rule that matched with the highest confidence will be presented in the prediction. If we see no-reply@contractnoreply@contract.fit
somewhere in the uploaded document, a prediction with confidence 97 for the tag_option no_reply
will be returned.
...
This function contains 4 options, all are optional:
preprocess_text: This is a boolean specifying if the text needs to be cleaned (= removal of noise such as unexpected characters eg. multiple dashes).
search_in (for email): This field specifies which part of the email we look in. There are 5 options which can be combined. When left empty we will look in all these options:
email_from
email_to
email_subject
email_body
attachment
limits: This field specifies which limits we apply to our search_space (more info here).
Granularity
Search_in
In the example given above (“is the sender no-reply@contractnoreply@contract.fit
or info@contract.fit
?”) our rules applied to the entire document. This is too broad for this use-case. Instead we would like to limit our search to the email field which specifies the sender, the email_from field. We can do this by adding the following:
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+rule": ["L:no-reply@contractnoreply@contract.fit"] "where_to_search": {"search_in": ["email_from"]} } |
...
These are the 4 options to limits:
document_types: list of document types, these can be combined.
pages: list of slices to specify which pages you want to search in.
lines: list of slices to specify which lines you want to search in.
characters: list of slices to specify which characters you want to search in.
email_chains: dictionary of an integer as key (specifying what chain to look in) and a list of slices as value.
For option 2, 3, 4 and 4 5 we use the method of a Python Slice. They require a specified start and stop value to be defined. Start and stop can be integers, to define an index, or floats to define a percentage of a slice. Start and stop need to have the same type.
...
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+rule": ["L:no-reply@contractnoreply@contract.fit"], "where_to_search": { "search_in": ["email_from"], "limits": { "characters": [[0,10], [-10]] } } } |
...
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+rule": ["L:no-reply@contractnoreply@contract.fit"], "where_to_search": { "search_in": ["email_from"], "limits": { "characters": [[0., 0.2]] } } } |
if the email_from is the letters of the alphabet: “abcdefghijklmnopqrstuvwxyz”. We would now search in “abcdef”
Example 3: Imagine that in our example we have chain of emails sent back and fort, with the first email being forwarded from noreply@contract.fit:
View file | ||
---|---|---|
|
If we don’t specify any limits on our email chains, it will look in all and find an email coming from noreply@contract.fit
. We only care who sent the first email however. So we specify a limit on the email chains which limits the email search space to the first email in the first chain of emails.
Code Block | ||
---|---|---|
| ||
{
"confidence": 97,
"+rule": ["L:noreply@contract.fit"],
"where_to_search": {
"search_in": ["email_from"],
"limits": {
"email_chains": {"0": [[0, 1]]}
}
}
} |
Anchor | ||||
---|---|---|---|---|
|
The granularity allows you to specify in which blocks of text we want to search.
The options are:
full (default if nothing is specified)
page
sentence
paragraph
line
Logical additions to “+rule”
...
Let’s say we want to match the emails coming from no-reply@contractnoreply@contract.fit if and only if in the body we don’t see a phone number. Our rule would now look like this:
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+and": [ { "+rule": ["L:no-reply@contractnoreply@contract.fit"], "where_to_search": { "search_in": ["email_from"], "limits": [[0,10], [-20]]} }, { "-rule": ["L:\\+32\\d{9}"], "where_to_search": { "search_in": ["email_body"] } } ] } |
...
Code Block | ||
---|---|---|
| ||
{ "no_reply": { "variables": { "var1": ["L:\\+32\\d{9}"] }, "rules": [ { "confidence": 97, "+and": [ { "+rule": ["L:no-reply@contractnoreply@contract.fit"], "where_to_search": { "search_in": ["email_from"], "limits": [[0,10], [-20]]} }, { "-rule": ["D:var1"], "where_to_search": { "search_in": ["email_body"] } } ] } ] } } |
Lemma
Anchor | ||||
---|---|---|---|---|
|
Lemmatisation is the process of giving all words the same form. More information can be found here: https://en.wikipedia.org/wiki/Lemmatisation. This is a very powerful way to improve the matching capabilties of regexes. This is a list of strings that are contained in the granularity. Here we don’t look at the original text, but the lemmatised version of the text.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
By adding the option where_to_search::search_in to your rule. An example field would look like this:
|
...