Configuring predictor settings
Have you noticed that a majority of your files are from the same supplier? Do you know up front which are the values you expect for a specific field? Would you like to set a default prediction for one of the fields in the format? Are there any values that you want to block from being predicted?
If the answer of any of those questions was positive, you might be interested in our predictor settings feature. In the predictor settings, you will be able to give hints to the model per field through the parameters fallback, expected_values, whitelist, and blacklist.
fallback: values that you would like to set as default prediction if the model did not find any other prediction candidate
expected_values: values that you would like the model to look into first, as you expect that they might be the correct ones
whitelist: values that you allow the model to return
blacklist: values that you do not want the model to predict
table extraction settings: line item specific settings
field_settings: field specific settings, indexed by field name
expected_pattern: a regex pattern to look for, in case a table cell contains more than what you need to extract. Also helpful in case you have uncommon patterns for typical data types. E.g. a custom date format, a weird looking amount or a custom unit.
override_default_settings: indicates whether or not the values in the expected patterns should override the expected patterns will reign over what the model predicted
nested_field_patterns: regex patterns, indexed by field name in case fields are nested inside each other
default_value: a default string value to extract for this field, in case nothing was found on a row
Payload example: (have to remove the “*”)
{
"fallback": {},
"expected_values": {},
"whitelist": {},
"blacklist": {},
"table_extraction_settings": {
"field_settings": {
"field_name_1": {
"expected_pattern": "string",
"override_default_patterns": true,
"default_value": "string",
"nested_field_patterns": {
"nested_field_name_1": "custom_regex_pattern_1",
"nested_field_name_2": "custom_regex_pattern_1"
}
},
"field_name_2": {
"expected_pattern": "string",
"override_default_patterns": true,
"default_value": "string"
}
}
}
}
Adding predictor settings for header fields
Let’s say for example you have the following information:
You process only invoices and 70% of your invoices come from Supplier X and Supplier Y combined, whose VAT numbers you know and are
BE0123456789
andBE0987654321
Put them in
expected_values
so that you can increase recall or true positives
You know that you mostly receive invoices in the
Euro
currency and have experienced that the currency field is mostly emptyPut them in
fallback
so that you can decrease false negatives
You sometimes get your own company number VAT (
BE0111222333
) as predicted "sender_VAT", which is incorrectPut them in
blacklist
so that you can decrease false positives
You would then send the following payload under PATCH/predictor_settings/{scope} and add the scope (which can be your inbox UUID or project UUID) where you would like these predictor settings to work:
{
"fallback": {
"currency": "EUR"
},
"expected_values": {
"sender_VAT": ["BE0123456789", "BE0987654321"]
},
"whitelist": {},
"blacklist": {
"sender_VAT": "BE0111222333"
}
}
Adding predictor settings for line items
The same logic applies to line items. Let’s use the following examples:
You process invoices with tables
In majority of your tables, you notice the description contains the package size of the shipped item. You know that it’s always a volume that looks like this: 20x30x40cm. You can add that to the
nested_field_patterns
so the model knows that the volume can be found inside the description.
{
"table_extraction_settings": {
"field_settings": {
"description": {
"nested_field_patterns": {
"volume": "\\d+x\\d+x\\d+cm"
}
}
}
}
}
Overwriting:
It is possible to leave out a parameter for example having “expected_pattern” and “default_value” but not “override_default_patterns” and “nested_field_patterns” in the payload. However, if we leave out a parameter, this overrides the existing value for that parameter (by being empty).
However if we have “fallback” values set, and we now just send payload for table extraction, then we will not “erase” what’s in the fallback
So we should use “PATCH” as if it will completely replace everything (so if we have existing predictor settings and we just want to add some rules, we need to first GET, the modify what we want to change, and PATCH the whole new thing)
Notes:
Scope: If the scope is set for a project, then the predictor setting will be applied to all inboxes that are inside the project
Shortcomings:
Pattern efficiency: If the regex is too complex, this can cause timeout for the whole line items
Pattern correctness: Pattern syntax has to be validated, easily check by regex101: build, test, and debug regex (have to check if there is a safety net) → the endpoint should just error “your pattern is garbage”