A format essentially defines what you want to get out of a certain type of document. It defines the structure of the output and is very visible in the Data View view of the Data Entry Companion.
Format is a structured collection of Fields of interest and consists of
Text fields: Pieces of information that appear literally in the document
Tag fields: Information that does not appear literally in the document, but that can be derived. Tags come from a finite set of options.
Rows: Text and Tag fields that logically belong together
Format consists of at least one table: Format table.
There can be as many additional tables as needed for information that follows a row/table logic.
Format Table
...
...
Edit Formats
Add new format: Click on the blue round button on the top right corner. Type your new format name in the pop-out box and click confirm. This will automatically add a new entry in the formats table.
Remove format: Click on the red delete button beside the format that you want to remove.
Edit format: Click on the blue edit button beside the format that you want to edit.
...
...
Format Table
...
These fields have a number of properties.
...
Mandatory
You can indicate whether or not a field is mandatory. Mandatory fields will be marked as read when not found and will block submission if not filled
false
Multiple
Index | Property | Description | Default |
Active
A toggle to turn the search for this field on/off
on
System name
An immutable uuid that serves as a reference to the underlying machine learning model
UUID assigned by the system
Type
Indicates whether this is a text field or a tag field
text field
Options - text field
For a text field, the number of options is virtually unlimited. You can specify a data type other than string to reduce the number of options for a text field. For example, here you can specify that the text should be an amount
string
Options - tag field
A tag field can only take values from a finite set of values. Here, you can exhaustively list the options for that tag field
Empty list
Formula - computed field
A formula through which allows you to add excel-like business logic to our solution
1 | Mandatory | You can indicate whether or not a field is required before a file can be submitted. Mandatory fields will be marked as red when not found and will block submission if not filled. | false |
2 | Field | Type in here the technical name of your fields. These names have to be identical to the field names that we support out-of-the-box (case-sensitive too). If we do not support the field yet, then no predictions will be made. | string |
3 | Annotation, Tag |
| annotation |
4 | Data type | Indicates in what format the fields should be saved | string |
5 | Scope | Indicates whether the scope is a page, a section or a document | section |
6 | Visible | Indicates whether a field is visible in the review page | true |
7 | Multiple | Indicates whether a field of interest can appear multiple times in the scope. If this is disabled, the machine will predict only one value for the field. | false |
8 | Count in evaluation | A flag to decide whether the specific field should be evaluated in the statistics. Setting this property on false |
enables some fields not to be taken into account in evaluations. This is useful for commentary and other optional fields. | true | |
9 | Conditional | Some fields of interest are only relevant depending on the value of other fields of interest. For example, you may only be interested in the VAT number of an invoice if |
it has been confirmed that the document is a valid invoice (valid_invoice field == true) |
. | none | |
10 | Display name | This is the label that will be shown in the |
Front-End to the user of the data entry companion and in the stats pane |
. As opposed to the Field (2)(or technical field name), you can name the display name to a name that you think is more intuitive than the technical names of our out-of-the-box fields | same as value in Field (2) | |
11 | Technical name | This is the label that will be used when communicating to servers. |
Same as display name
Other tables
You may want to maintain the row logic for tables in the extraction of information. To this end, we allow you to specify tables, which contain one or more row types.
For example: you may have a table of line items with four columns (description, unit price, quantity, line total). You would then have two types of rows for this table: line items and a total line. The total line would be of a different type as it will not have a unit price. It will have a fixed description ("total"), a quantity and an overall total.
Rows are essentially a sorted list of text and tag fields. In addition, they will have a row_type, which defines to which table they belong.
Edit Formats
Navigate to "Studio" > "Formats" to get an overview of the available formats. Click on edit format to have a view on the format. You'll see the Fields table and other tables for the format.
Edit Formats through the Studio Controls in the Data Entry Companion
Format defines the fields you are interested in for a given document type. It is reflected in the data table that you can see in the Data Entry Companion. We have also made it possible for you to edit the format directly from within the Data Entry Companion.
To do this, you need to enable the Studio Controls from within the Data Entry Companion. You will recognise the Studio controls by the three dots icon.
Edit properties of fields
You can edit the properties of a field by clicking on the three dots in front of the row in the Fields table. For other tables, you can click on the three dots in the column headers to edit the properties
Add fields
You can add Text or Tag fields in two ways:
By using the placeholder in the bottom of the Fields table
By dragging a box around an area of interest and indicating that you want to create a new field
Train
12 | Description | A short text to characterise the field of interest. |
Scope (5) and the importance of page/subpage splitting
It is important to distinguish the different types of scope: file, document, page and sections.
A file is a container in which some data is stored. These can be a .pdf, .jpg, .eml, .zip and so on. The illustration above shows one file.
A document is a representation of information that can be understood by a human. Here, we talk about invoices, receipts, ID cards, and so on. Each file contains at least one document but can contain more than one. In the illustration above, this one file contains 3 documents: a contract, an invoice, and an ID card document.
A page is one side of a document. In the review pane, you can view one page at a time. Page splitting enables the classification of different formats within one file. Each file contains at least one page, and the same goes for documents. The illustration below has 4 pages (1 page of a contract, 2 pages of invoices, and 1 page of an identity card).
A section is a part of a bigger item: A file can be split into different documents (sections), and a page can be split into different subpages (sub-sections). Subpage splitting enables the detection of different sections within one page. Sections are encountered usually when smaller receipts or ID cards are processed. For example, the front and back side of ID cards are usually saved in one single page, or several receipts are usually grouped together in one single page. The illustration below shows that the fourth page is divided in 2 sections.
For fields in tables, there can only be one scope chosen per table.
Conditional (9)
It is possible to add logical AND/OR conditions for specific fields. Choose any field and click on “show” in the conditional column. On the pop-up screen you can add rules that condition the situation where your selected field will be shown. It is also possible to add groups to manipulate the AND and OR logic. In the illustration below, the condition would work as follows: The field of interest would only be relevant if
The gross amount is not null, AND
Either the amount payable or the net amount is not null.
...
Predefined data types
“currency”, “language”, “country”: Creating these “currency”, “language”, and “country” fields now will have standardised predictions with our pre-filled predictors such as “USD“, “EUR“, “GBP“, etc. for currency. This also means that annotating currency signs ($) in the Data Entry Companion for currency fields will be correctly formatted to “USD”.
date fields: These fields will return a formatted field under the format DD/MM/YYYY. So even if in the Review page, the date is formatted as DD/MM/YY, this field will be correctly reformatted as DD/MM/YYYY.
...
Tables
You may also want to create a section for tables.
This would be useful when you process documents with line items or receipt lines. For example: On your invoice you may have a table of all the articles you have bought (each article has its own article number, description, unit price, and quantity). This information is different from the header fields that would generally appear only once in a document, such as one invoice date per invoice.
...
As shown on the illustration above, you would have two distinct places for header fields and line items: the header fields come in the first part of your format table, right below “Metadata”, whereas the line item fields come in the second part, below the word “Table”.
There will only be one section “Metadata”, where you can enter all the header fields you require, but there can be one or more tables, where you can enter all the line item fields you require.
To add a table, simply click on the last row, which says “Table name”, and enter the name of the table.