This document also defines the format for structured text output resources (OTrx) and their supporting data, which are created by the translation engine, then used to display the translation history graph.
For additional information about these resources or their related data, see the BibleTrans Design Decisions.
Several of the formats share a standardized way of encoding a list of text items, consisting of a single number at the front defining how many (or which) items are there, followed by up to 31 integer links to the text items, followed by the text items themselves. The low 12 bits of each link is an offset to the (integer, 4-byte-increment) beginning of the text; the next 6 bits is the number of characters in the item, and the upper bits usually encode the pixel width of that item in its normal font. The Resource Viewer in DocPrep should know about most of these, and display them properly.
Often a text list aggregates multiple single-line text items such as checkboxes or pushbuttons. The aggregate can be displayed separately as an editable list, permitting the user to change the names of the checkboxes. The resource ID of the list is a multiple of 32, and the IDs of the collected one-line items are each offsets from that list ID, by the line number. Thus 64 is the list of language categories, and 65 is the first checkbox in that list. These list/checkbox combinations can also be used to enable or disable other data formats, by the matching low bits of their respective IDs. Thus checkbox 65 enables radio button group ID 1089 and variable list ID 2113, as well as L&N linkage table ID 32065; it also supplies a name to be added to the title of the Language Category 1 window where the category variables are displayed.
The numbers in the list here following index the various display alternatives
for purposes of displaying and editing the image elements. They are stored
in sequences of Adat resources separately numbered (shown
in parentheses below).
The second word of the format has 1s in the bits where elements in the corresponding item list represent variable names, and 0s where the elements are computed values. Only variables in the item list can be linked to slot labels in the Dot Connector format.
Each line of the group can have up to 31 elements, one per byte with
a byte count at the front of the line, for a total of exactly eight integer
words (32 bytes) for each line, starting at byte 8 (third integer word).
Only the low five bits of each byte are significant; the other bits may
be set to facilitate exporting, but are ignored.
The second word of the format is a row/column count, the number of actual rows and columns to be displayed; the row count is in the high half, and the column count is in the low half. Following this count word are tab stops for each column boundary (one more than the number of data columns), the pixel position of the left edge of that column. These tab stops are calculated dynamically from the actual widths of the labels and data items.
Following the tab stops is one word for each data item, (rows * columns)
words. Each word is encoded to one of these (high two bits):
00 | No data, or integer > 0 | |
10 | Character | |
01 | 4 Chars | |
110 | Text link | |
111 | Negative integer |
An integer 0 value is encoded as a character '0'. The low 18 bits of
a text link are understood in the same way as the links in a text list,
12 bits are an integer offset (at the end of the table), and six bits are
the text length. Four-character items allow short text entries (starting
with a letter) without allocating and managing variable-length text space.
Following the tabs, there are two words for each row: the first word
is encoded with the L&N concept number, and the second word is bitwise
encoded with the checkmarks for that row, the least significant bit corresponding
to column 1 and language category 1.
The high half of the first word contains the 11-bit ID number of the node shape; the top 3 bits select the icon type, and the low 8 bits enumerate the various shapes defined for that type. The low four bits of the first word is the number of connection patterns in this group, and the remaining bits are used to temporarily hold the active group and anchor dot while the user is forming a connection. The high half of the second word contains the (relative local) xy coordinate of the free end of the connection line being formed, and the low half is the ID of a variable that selects which pattern to use when there are more than one.
Each defined pattern takes six integer words: the high half of the first
word is the drag line ID, and the low four bits
selects one of its lines, if more than one. The remaining bits of the first
two words contain formatting information. The last four words in each pattern
enumerate in Item List order, which slot is connected
to that item, four bits each, for a total of 31 possible items. Zero in
any item position is no connection. There are at most eight slots in any
node shape, the index of which fits easily in four bits.
The values are stored in prefix Polish form, one integer per code; the operation is in the high byte, and the xy location of its popup in the displayed image is packed into the low 24 bits. The five data codes are followed by one or more words of actual data. Expression values can be arbitrarily complex, up to the size limit of resources; nested values are shown in depressed rectangles, with the operators in popup menu buttons.
0 Null data 1 Integer data 2 Up to 4 chars of text data 3 Variable data 4 Formatting code data 5 String Length() function 6 Negative 7 Logical NOT 8 + 9 - 10 * 11 / 12 MOD 13 AND 14 OR 15 XOR 16 String Item() function 17 String ConCat() function 18 < 19 >= 20 <= 21 > 22 = 23 unequal 24 L&N number of tree node 25 parent of tree node 26 sibling of tree node 27 child of tree node 28 noun# of tree node 29 Bible reference (bk,ch,vs) 30 pronoun # of tree node 31 32 ... 62 else 62 else 63 if ... then 128+n n-character string data
The high half of the first word is the number of variables being set,
and the low half is the Item List ID containing
popups for possible values. Each variable line consists of a word containing
the variable list ID and line number (in its high
half) of the variable being set, followed by a word for the value to be
assigned, encoded similar to table values.
The low 22 bits of the first word chooses a variable to connect. The
same bits in the second word identifies a selector variable, if there are
more than one line; the number of lines is in the upper byte of this second
word. Each additional word is one line of connection, the drag line ID
in the low half, and its line number in the low four bits of the high half.
0 <GfDg ..> This is a placeholder for a 12x16 glyph table starting in resource#10000 (it could grow as big as seven or eight 1K resources, depending on how many characters are defined and how wide they are). The table is in the form of a font table as used for normal text display: Table position [0] is the 4-character font name, [1] gives the total height and ascent (from base line to top of cell) in 16-bit numbers, followed by [2] the character spacing and [3] space width. Beginning in [4] is an index of offsets to the beginning of that ASCII character (-28, that is, [4] is space character 32, [5] is '!', and so on) to character 255 in [223], and the offset to the end of the table in [224]. Glyph width is determined by the difference between adjacent glyph offsets. See example below.1 <GfDg ..> This is a placeholder for a panel displaying the pixels of a single glyph enlarged for editing. The resource contains the index of the selected glyph, and some positioning information; the actual pixels are in the table. Only the first word [0] is used: the low byte is the selected glyph, or zero if none; the next four bits is the cell width (editable white background), then a 4-bit offset from the left of the panel, the width of the grey zone there. The upper half of this word is a 5-bit offset that allows for the black pixels to start other than at the left edge of the character cell (as in the font table). The pixels in the file are always normalized, but you don't want them jumping around in the editing panel as you add or delete pixels on the left.
2 <GfDp ..> This is a kerning table, in case the language needs to overlap vowel and consonant glyphs (not yet implemented).
3 <GfDp ..> This is a text string in the glyph font, so the characters can be viewed in context.
4 <GfDD ..> This is 27 character sets (named by each letter of the Roman alphabet, plus a special set of word breaks named "#". Each set is a 224-bit bitmap, one bit for each character in the set.
5-7 <GfDp ..> These are three groups of morphological (character substitution) rules to be applied after translation. Because the translation rules are ASCII only, one of these (#6) defines a conversion from (Roman) ASCII to whatever character font is defined in the glyphs. The other two perform substitutions before (#5) and after (#7) conversion. The differences are superficial (which font the characters are displayed in); all rules work exactly the same, and are tested in strict numerical sequence exactly once on each generated text character. Each rule is stored as a sequence of characters that is the "context" for applying the rule, followed by a sequence of characters to replace the match with. Character set codes (1-27) can be used in the rules to refer to any character in the corresponding lettered set (#4 above).
OTrx #32767 is an index of the available translated text resources in this file, in reverse order of translation (most recent first). The beginning of the resource is the number of entries, then the date (in seconds since 2000 Jan 1) of most recent translation. Beginning with offset +2, each entry consists of two numbers, first the episode number from the Tree episode that was translated, then the specific Bible reference (if any, or else the reference associated with that episode), which is used to represent it in the Translation menu. Each such episode is stored in one or more OTrx resources, numbered first by the episode number, then sequentially by adding 1024 (0x400) to the episode number, for a maximum of 32 segments, which should be sufficient for a thousand words.
When it is formatted, the effective image can be up to a half-million pixels wide, which is sufficient for 30,000 characters 16 pixels each (including spaces), or about 5000 medium words, eight single-space typewritten pages. Properly encoded, no Bible episode should generate more than a quarter of that.
The first eight numbers in the first OTrx resource are a header:
+0 The the start of the node data (=8) in high, the next available item, just past end of the data, in low.Subsequent OTrx resources in the same episode have only a 4-word header:
+1 The episode number
+2 The rectangle, in 16-pixel scroll units (may be 0 in file)
+3 The size of display data (may be 0 in file)
+4 The Bible reference
+5 The date&time this resource was created
+6 Pixel offset (to the left of this resource)
+7 (reserved for future use)
+0 The the start of the node data (=4) in high, the next available item, just past end of the data, in low.After the header, each node in the graph consists of two numbers. The column top is distinctive, in that it defines the horizontal position of the translated word (which may be 0 in the file, it is calculated on the fly and saved to the file only if there is more than one resource), and a link to the next column top (the next translated word in the resource). If there is a gloss, it looks like a column top but the horizontal position part is zero (the gloss is centered under the previous column top) and there is no additional history data following it.
+1 The episode number
+2 Pixel offset (to the left of this resource)
+3 The size of display data (may be 0 in file)
The second word of the column top or gloss item is an index into the EmTx (output text) or GlTx (gloss) resource with the same episode number. Duplicate words in the output text point to the same word in the word list. The index points in turn into a TxtS resource of the same episode sequence, where the text string of that word may be found. The index word is packed from three numbers, the byte offset into the selected TxtS resource, six upper bits of the resource number (added to the episode number) which count down from 31 (31744 + episode number), and the length of that word. Small colored rectangles represent non-text output, the first three "words" in the list.
Following the column top are rule references, two numbers each. Each rule reference consists of four 16-bit numbers packed into two integers: the line number in that rule, and the rule reference number, then some flags. The flag word is positive for lexical rules (seven bits of domain over nine bits of concept within that domain), and negative for a named rule index into a CodX resource (see "Compiled Rule Code"); its low bits are zero when there is more in this column, or non-zero when the next number links to another column for a horizontal line or is zero at the root of the tree.
EmTx (output text)
and
GlTx (gloss) resources are sequences of integers. Each
entry is packed with a byte offset (low 12 bits), then a part index (6
bits, tacked onto the 10-bit episode, which is the resource number where
this word can be found), a 6-bit length, and 8 bits of display (pixel)
width if it fits. These in turn point into TxtS resources
containing the actual text, which is limited to 61440 bytes of output text
(they could be unique words or word fragments, but now is straight output
text, plus separators, including larger gaps at resource boundaries, so
that no word crosses a boundary) in a single episode. This is more than
enough to accommodate several thousand words, far larger than the biggest
well-formed BibleTrans episode.
Rev. 2013 September 27