...
preprocess_text: This is a bool boolean specifying if the text needs to be cleaned (noise removed= removal of noise such as unexpected characters eg. multiple dashes).
search_in (for email): This field specifies which part of the text email we look in. There are 5 options which can be combined. When left empty we will look in all these options:
email_from
email_to
email_subject
email_body
attachment
limits: This field specifies which limits we apply to our search_space (add link to further here).
granularity: This field specifies what the granularity for a match should be (add link to further here).
In the case of our example, we only want to look in the rule would now email_from, so the rule will look like this:
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+rule": ["L:no-reply@contract.fit"] "where_to_search": {"search_in": ["email_from"]} } |
...
You can apply different limits to the place you are searching inlimit the search space of your query. For this we use the notion of a python slice.
These are the 4 options in limits:
document_types: list of document types which can be combined.
pages: list of slices to specify which pages you want to search in.
lines: list of slices to specify which lines you want to search in.
characters: list of slices to specify which characters you want to search in.
To specify the part of the full object you want, you need to specify a list of slices. This The syntax of a slice is as follows:
Code Block |
---|
[start, stop] # items from start through stop-1 [-start:-stop] # items from start (counting from end) through stop-1 (counting from end) [start] # items from start through end (only allowed for the last slice) [-start] # items frpm start (counting from end) through end (only allowed for the last slice) |
These are the 5 options in limits:
...
...
pages: list of slices to specify which pages you want to search in.
...
lines: list of slices to specify which lines you want to search in.
...
characters: list of slices to specify which characters you want to search in.
If we go back to our example and only want to look in the first 10 and last 20 characters of the email_from our rule would now look like this:
Code Block | ||
---|---|---|
| ||
{ "confidence": 97, "+rule": ["L:no-reply@contract.fit"], "where_to_search": { "search_in": ["email_from"], "limits": { "characters": [[0,10], [-20]] } } } |
Granularity
The granularity allows you to specify in which blocks of text we want to search.
...