Structured Text Output Window


When BibleTrans works properly, it works very well. Until then, it is very complex. One of those complexities is the Structured Text Output window, which gives the user some insight into where the (initially incorrect) translated text came from. Each word and word fragment of the translation is generated by a rule, typically a Syntax line. Knowing which rule, and the chain of rules that invoked it, can sometimes be crucial in determining why the translation is not what was intended. Usually the computer will do exactly what you tell it to do, even if you didn't want to tell it to do that. Debugging (fixing errors in) your grammar consists mostly in finding where invalid assumptions resulted in unexpected results.

This document describes (mostly for programmers like myself who need to maintain the code) how the structured text window data is constructed and formatted for display.
 

Resources

There are three kinds of resources in the language file, all numbered by the episode being translated:

Imag This is defined and described mostly in the picture specification part of the Design Document. It is the drawing commands to display the structured output.

EmTx This contains the translated text and a word index, as described in the Design Document. The OTrx resource uses short-number references to link to the EmTx index, which then links to the text one word (or word fragment) at a time.

OTrx There is a brief discussion of the OTrx format in Data Formats. This present document is concerned more with why and how, rather than where the bits are. The first OTrx resource is numbered by the episode; subsequent resources as needed are given numbers +1024 (there are less than a thousand episodes, so no collision is possible).
 

Example

I opened the "John+Mary-Base.txt" Tree file, then translated it using the standard (default) English grammar. This is the structured output window:

The tiny hex numbers (except brown, not shown here) represent the OTrx resource positions of that information. They are enabled in the debugging mode, which you can turn on by typing "Tlog" (no quotes) into the entry box that opens from menu Debug->Open Doc Pg#. When the image is sufficiently wide, tiny brown numbers along the top show pixel positions in increments of 256. Here are the respective resources, as dumped by DocPrep's ResViewer, which I colored to match the colored numbers (and squares) to which they correspond in the picture:
 

[003F] OTrx #958:
000:  0008004A 000003BE 0014001D 01B00100
004:  00430101 197D102C 00000000 00000000
008:  001E0012 00020001 100C0000 00052409
00C:  801E8102 00020004 01148119 00020008
010:  012B0054 00000000 005C0024 12010004
014:  100C0000 01002441 802A0000 01FF25CB
018:  80420000 0007BABE 005A0000 14010003
01C:  007280CC 0100241A 80DE0000 01FF2624
020:  80ED0000 0007322B 00FC0102 001E000D
024:  0096002A 1C010005 100C0000 1601242C
028:  801E00CC 0000001D 00CF003C 12010006
02C:  100C0000 01002441 802A0000 01FF25CB
030:  80420000 0007BAFD 005A0000 06010003
034:  00720000 01012439 808A0000 0001258E
038:  80A20000 1801005C 00BA00CC 005C001D
03C:  010B0042 00010003 100C8014 00022409
040:  80260119 0000000F 01320046 00010007
044:  100C0014 010B003F 0144004A 00020008
048:  100C0119 0059000F 00000000 00000000

[0062] EmTx #958:
000:  0000001C 0001F001 0005F002 0005F001
004:  1A11F004 1C15F009 1D11F00F 0305F015
008:  0C11F017 00000000 00000000 00000000

[0063] EmTx #32702:
000:  0D0A090A 4A6F686E 206C6F76 6573204D      John loves M
010:  61727920 082E2020 20202020 00000000  ary  .

Here's how to interpret the data, as best I can figure it out this long after writing it: The green tiny numbers at the top of the image represent the column tops in the data (also green in the OTrx). The first column always starts at offset [8], which holds the horizontal pixel position of the column center (001E) and the link to the next column top (0012). It is followed by the line number in the current rule (0002 in "DoSentence") and the first word in the translation (0001), which is a capitalization code because of being a sentence start (the bright red square, 09 in the text resource), then the vertical position of that text (00C) over a halfword that is zero here, but otherwise contains the horizontal position of a join. This is followed by the line number (0005) of the invoking rule with the ID of the "DoSentence" rule (2409 in the next word, actually an index into CodX resource #9, offset +009, which links to that rule's code in Code #1 +06C),then something related [I don't know exactly how] to the coordinates of the line below rule name, with the high bit set because it's a named rule. These two words are repeated for lexical rule 0.4 then 0.8, where each blue number represents a join formed when later column encounters a rule already in the resource: the blue number is the first of the two items for the rule below the horizontal line, and the low half of the previous word has bit 15 set, or else a link to that actual item (and the previous halfword has bit 15 =0, see 029). The next word (at [18] =012 in green again) starts the next column with emit text "John" (word #4  in the EmTx, which links to byte 4 of the text EmTx for 4 bytes) from line 1.9 of rule "NounPhr L", and so on.
[0042] Imag #958:
000:  014001D0 20005020 10003712 30303800   @    P   7 008
004:  10000105 2000701B 00006006 10000100        `   `
008:  2000F01E 3001501E 2001F000 1000B100      0 P
00C:  446F2053 656E7465 6E636500 2002101E  Do Sentence
010:  3010301E 20101020 100037B4 30304400  0 0       7 00D
014:  2000505E 10003712 30313200 2000D04F    P^  7 012    O
018:  10005100 4A6F686E 20000000 2000F05C    Q John       \
01C:  3002105C 2002B03F 10009100 4E6F756E  0  \   ?    Noun
020:  50687220 4C000000 2002D05C 3003905C  Phr L      \0  \
024:  20043048 10007100 2A546869 6E672000    0H  q *Thing
028:  2004505C 3005105C 2005B04C 10006100    P\0  \   L  a
02C:  39332E31 39300000 2005D05C 3006905C  93.190     \0  \
030:  20073055 10003100 302E3300 2007505C    0U  1 0.3   P\
034:  300CD05C 200CB05E 100037B4 30314400  0  \   ^  7 01D
038:  20005098 10003712 30323400 2000D088    P   7 024
03C:  10005100 6C6F7665 73000000 2000F096    Q loves
040:  30015096 2001F07B 10009100 56657262  0 P    {    Verb
044:  50687220 4C000000 20021096 300CD096  Phr L       0
048:  200CB098 100037B4 30323900 200050D1        7 029   P
04C:  10003712 30324100 2000D0C1 10005100    7 02A       Q
050:  4D617279 20000000 2000F0CF 300210CF  Mary        0
054:  2002B0B2 10009100 4E6F756E 50687220          NounPhr
058:  4C000000 2002D0CF 300390CF 200430BB  L       0     0
05C:  10007100 2A546869 6E672000 200450CF    q *Thing    P
060:  300510CF 2005B0BF 10006100 39332E32  0         a 93.2
064:  35330000 2005D0CF 300690CF 200730C8  53      0     0
068:  10003100 302E3300 200750CF 300810CF    1 0.3   P 0
06C:  2008B0B3 10009100 4164706F 734C6E20          AdposLn
070:  4C000000 2008D0CF 300990CF 200A30B3  L       0     0
074:  1000A100 2A53656D 2E526F6C 65200000      *Sem.Role
078:  200A50CF 300B10CF 200BB0C5 10005100    P 0         Q
07C:  302E3932 20000000 200BD0CF 300CD0CF  0.92        0
080:  300CD05C 200CC0D1 10003700 30334200  0  \      7 03B
084:  200CD095 300D5095 200DF07E 10009100      0 P    ~
088:  50726F70 6E73204C 20000000 200E1095  Propns L
08C:  300E4095 200EE076 1000D100 2A747261  0 @    v    *tra
090:  6E736974 69766520 20000000 200F0095  nsitive
094:  300F3095 200FD088 10005100 32352E34  0 0       Q 25.4
098:  33000000 200FF095 30103095 3010301E  3       0 0 0 0
09C:  20102097 10003700 30323300 20103059        7 023   0Y
0A0:  3010B059 20115052 10003100 302E3400  0  Y  PR  1 0.4
0A4:  20117059 3011A059 2011805B 100037B4    pY0  Y   [  7
0A8:  30304600 2000510D 10003712 30334300  00F   Q   7 03C
0AC:  1000011E 20007108 00006006 10000100        a   `
0B0:  2000F10B 3001510B 2001310D 100037B4      0 Q   1   7
0B4:  30334600 20005134 10003712 30343200  03F   Q4  7 042
0B8:  2000D131 10001100 2E000000 2000F132     1    .      2
0BC:  30015132 3001510B 20014134 10003700  0 Q20 Q   A4  7
0C0:  30343500 2001511E 3001D11E 20027100  045   Q 0     q
0C4:  1000B100 446F2053 656E7465 6E636500      Do Sentence
0C8:  2002911E 3011A11E 20118120 100037B4      0         7
0CC:  30343100 20005146 10003712 30343600  041   QF  7 046
0D0:  2000D13E 10005100 20202020 20000000     >  Q
0D4:  2000F144 3011A144 3011A059 20119146     D0  D0  Y   F
0D8:  10003700 30343900 2011A0CE 301220CE    7 049     0
0DC:  2012C0C7 10003100 302E3800 201310D0        1 0.8
0E0:  10003704 30313100 00000000 00000000    7 011

[000D] Hots #958:
000:  2B456D78 03BE0009 0006001B 0012003C  +Emx           <
004:  436F6479 80002409 0015FFFE 0021003F  Cody  $      ! ?
008:  2B456D78 03BE0013 0003004D 000F006B  +Emx       M   k
00C:  436F6479 81002441 0021003D 002D007C  Cody  $A ! = - |
010:  436F6479 81FF25CB 00390046 00450073  Cody  %  9 F E s
014:  4C264E52 0000BABE 0051004A 005D006F  L&NR     Q J ] o
018:  4C264E52 00000003 00690053 00750066  L&NR     i S u f
01C:  2B456D78 03BE0025 00030086 000F00A6  +Emx   %
020:  436F6479 9601242C 00150079 002100B4  Cody  $,   y !
024:  2B456D78 03BE002B 000300BF 000F00E0  +Emx   +
028:  436F6479 81002441 002100B0 002D00EF  Cody  $A !   -
02C:  436F6479 81FF25CB 003900B9 004500E6  Cody  %  9   E
030:  4C264E52 0000BAFD 005100BD 005D00E2  L&NR     Q   ]
034:  4C264E52 00000003 006900C6 007500D9  L&NR     i   u
038:  436F6479 81012439 008100B1 008D00EE  Cody  $9
03C:  436F6479 8001258E 009900B1 00A500ED  Cody  %
040:  4C264E52 0000005C 00B100C3 00BD00DC  L&NR   \
044:  436F6479 000D4095 00D5007C 00E100AF  Cody  @    |
048:  436F6479 81FF2624 00E40074 00F000B6  Cody  &$   t
04C:  4C264E52 0000322B 00F30086 00FF00A5  L&NR  2+
050:  4C264E52 00000004 010B0050 01170063  L&NR       P   c
054:  2B456D78 03BE003D 00060108 00120216  +Emx   =
058:  2B456D78 03BE0043 0003012F 000F0136  +Emx   C   /   6
05C:  436F6479 0001C11E 001D00FE 0029013F  Cody         ) ?
060:  2B456D78 03BE0047 0003013C 000F014C  +Emx   G   <   L
064:  4C264E52 00000008 012200C5 012E00D8  L&NR     "   .
068:  6E6F586F 00000000 00000000 014001D0  noXo         @
06C:  2B655478 00430101 00000000 00100233  +eTx C         3
070:  00000000 00000000 00000000 00000000

StackEmit

StackEmit is the function in ExecEng.t2 that builds the OTrx resource. It is given a text string (typically a single word or word fragment), which it adds to the generated output text and its index, then proceeds to insert it into the current OTrx block after making sure it has enough space for this word and all its back-links.

The first word of the resource contains an index (resource increment + integer offset) to the next available insertion point, the translated word at the top of the next column of link-back names. That item in the resource will be the link forward to the next column top, followed by the reference for this word in the EmTx index, then the trace-back information extracted from the run-time rule stack, and terminated by a link back to whatever previous rule link already exists in the OTrx, or zero if it's the root.

The run-time stack, which you can examine in the Debug window, or in more detail in the debug log file, has links back from each rule invocation to the rule (and its line number) which invoked it, and so on all the way back to lexical rule 0.8 Root. The stack maintains a reference to the rule name for display in the Debug window, and a link back to the previous rule, so that when a rule exits the stack can be cut back to show the previous rule on top, and the translation engine can resume stepping through that rule. The rule names and line numbers are extracted by StackEmit to insert into the OTrx resource, so that if you click on one of those names in the window, that rule will be opened and that line number shown.

Each time a stack node is captured to the OTrx resource, it is flagged with the OTrx resource position, so that the next time a StackEmit trace-back hits it, it will terminat the trace-back capture and draw a horizontal line in the window. This is indicated by a special link in the resource.

StackEmit only stores the output text and links in the resource; PreCalcDisplay is responsible for placing them at particular coordinates in the window, and storing those coordinates back into  the same resource. It is possible that the window will be redrawn between output steps in the translation, so both functions monitor a sentinel variable, OTrxAsEmitted, to know if their copy of the OTrx resource is out of date.

Gloss text is inserted into the OTrx resource in the same manner, but it is assumed to be associated with the previous translated text, so there is no trace-back information.
 

PreCalcDisplay

PreCalcDisplay is called when the window needs refresh, to prepare the OTrx data for display. It is designed to work for a while, then pause while the computer does other things (like continuing the translation), then resume where it left off (unless additional output has made it obsolete, in which case it starts over anew). Word [7] of the first OTrx resource contains a non-zero link where the previous pass left off when it can be resumed.

PreCalcDisplay starts with the position of the first emitted text word, in position [8], which is placed in the top-left corner of the window, and that location is recorded in the resource. Each subsequent trace-back rule name is assumed to be in the same column, but the vertical positions are assigned with sufficient space to prevent collision with names in adjacent columns on alternate grid positions. When a horizontal link is encountered, a new column position is calculated in the middle between the most recent column and that far join, then all the other columns that happen to link to the same join are terminated (they will be drawn to meet the horizontal line, but no further), and the assignment of positions continues down the common trunk, assigning new horizontal (and vertical, but they would be the same) positions. Additional horizontal lines may be formed, until eventually the trunk reaches the root. At that point, PreCalcDisplay can continue with the top of the next column, or else pause and resume later. It only pauses at the top of a column.

The MOS window imaging engine is limited to 4K pixels wide, so if there is a lot of output text, it might exceed the pixel space normally available. I solved this by defining a pane not more than 4K wide, which can slide back and forth within the virtual image space as scrolled, and is redrawn if it scrolls past the available pixels. This would make the scrolling hiccup at those redraw points, but the hesitation is small. There are also packing limits, reached when the absolute pixel offset of an item in the window is more than 32K from the left edge, which is solved by starting each OTrx resource at a given pixel offset (in [2]) so that each horizontal position in that resource is relative to that offset. With less than 500 items in each OTrx resource, and maybe four items in an average single column before linking to the left, we might have a hundred columns of perhaps 100 (or less) pixels each, well below the 32K limit. A pathological wide and shallow grammar with very long words and/or a very wide font might hit the limit. StackEmit estimates the worst-case scenario and catches that then shortens the resource as needed.

After PreCalcDisplay has finished its analysis, it knows the total (virtual) image width, so the scroll offset can be determined for leaving the visible portion of the window at the rightmost edge (its default position) with a suitable setting of the VirtImOff field of the Widget. Any time a part of the virtual image (not in the current pixelated portion) is scrolled into view, BuildImage is called to redraw that part.
 

BuildImage

BuildImage examines the OTrx resource(s) and, using the low-level image-building functions in package ImDlogPkg, constructs an Imag resource, which the MOS drawing engine knows how to convert to pixels. Any image components to the left of the VirtImOff are omitted from the conversion, as are any components to the right of the right edge of the sliding panel as positioned. It turns out that the OTrx resource encodes more image in 1K of integers than can fit in a 1K Imag resource, so the pixel width is arbitrarily reduced to 2K pixels wide (which seems adequate). Real Soon Now I hope to fix the MOS resource manager to accommodate larger resources. So many things to do, so little time to do them.
 

First Draft: 2013 July 31