Entropy Document Reader Compressed Text File Format

Historical note:

The FRED document reader was written by Simon Cooke, and appeared first in FRED disc magazine, issue 17. It was the first of its kind to display compressed text faster than the equivalent BASIC uncompressed text reader. Since then, two more versions have appeared, one a minor bodge to provide printing facilities for the text reader (when version 1 was written, I didn't have a printer, or printer interface, so I had to give it my best guess on how it all worked. I was wrong), and the other being a slightly updated token set, as well as cleaned up file formats, addition of a "Return to help page" button instead of the "Article menu doesn't exist" message which was put in with the printer fix, and a tidier compressor program to go with it.

Version 1.2 will be the last one to use this format explicitly; work is in progress on a new, multi-format text reader program (the Entropy Chimera project) which will not only interpret Document Reader files, but also COMET assembler source files, SAM Small C source, Tasword and Outwrite files, MSDOS text files, Hypertext files, SAM BASIC programs, etc etc. It's all just a matter of time.

Document reader data types

There are three types of file associated with Document Reader – DCP files, MAG files and JYN files. However, the JYN file type only applies to version 1.2 of the document reader program.

*.DCP files contain a magic string to identify its own presence in the SAM system, as well as details of the number of pages contained in the MAG archive, and the page and offset in memory at which they can be found. DCP files are always loaded at 16384. NB: if the magic string is not found when the document reader program is run, then it will abort with an error message.

*.MAG files contain (for version 1.0) article menu data, giving page numbers at which certain topics held in the document can be found, as well as (for all versions) the actual compressed data stream.

*.JYN (joined) files are in actuality merely version 1.2 DCP and MAG files concatenated into a single file, in order to reduce the directory space utilised. The document reader program cannot read these files by itself; a short BASIC program stub is needed to arrange the data in memory in a suitable format before calling the reader.

Version 1.0 format: *.DCP file (load to 16384)

Offset Function Length

12 Number of pages present in document (1)

15 Page data: page number at which document page starts (1)

followed by an offset in the range 0-16383 (2)

The page data repeats until all pages in the document have been accounted for. (NB: The "" character is the space character, or &20 hex. The © symbol is &7F hex). The page data forms a 3-byte address within the SAM's memory for the start of each page of text in the document archive.

Version 1.0 format: *.MAG file (load to 38233)

Offset  Function                                      Length

0       Address of article page table                   (2)

2       Number of articles in article table             (1)

3       Maximum article name width (in characters)      (1)

4       Start of article name data                      (variable)

Article name data can be any length, and each article topic is delimited by the last character of the name having bit 7 set. The actual document text starts immediately after the article text in the file, and is pointed to by the DCP file for speed. The article width field is used in rendering menus. The article page table mentioned is a list determining which article topic corresponds to which page of the document. It is worthy of note that the first page in a MAG archive is the help page displayed when the document reader is first loaded, and that it corresponds to page number zero.

Version 1.1 file format

The version 1.1 file format is in fact identical to the version 1.0 format (the reason for this being so that it would be possible to load in old files for printing purposes). The only difference is that where the v1.0 file would nearly always have an associated article menu, v1.1 files usually have the article menu set to zero entries. Thus, the first 4 bytes of a version 1.1 file will nearly always be:

38233 Address of article page table: 38237

38235 Number of articles found: 0

38236 Max article descriptor width: 0

There will be no article name entries, or page table entries; thus the address pointer at 38233 is invalid, and the magazine page data itself will start at this address.

All a reader has to be able to do in order to read both v1.0 files and v1.1 files reliably is to be able to determine the number of articles in the article menu, and to ignore it is the number of articles found field is set to zero. This criterion is not met by the actual v1.0 Document Reader program, and is the reason why it often crashes if attempts to view files made for v1.1 with no article menu are made.

Version 1.2 file format: *.DCP files (loads to 16384)

Offset Function Length

14 Number of pages present in document (1)

15 Page data: page number at which document page starts (1)

followed by an offset in the range 0-16383 (2)

The page data repeats until all pages in the document have been accounted for. Version 1.2 files are differentiated from their v1.0 and v1.1 equivalents by the different magic string; this is necessary, as all remnants of the article menu data have been removed from the v1.2 MAG file, and also v1.2 uses a different, slightly more optimised compression token table than versions 1.0 and 1.1 (which both use the same one).

Version 1.2 file format: *.MAG files (loads to 38300)

Whereas the earlier versions held article data (or remnants thereof) here, all such data has been removed in version 1.2 so as to allow the files to take up less disc space. Thus, the only data to be found in the MAG file is the compressed document text itself, as pointed to by the DCP file.

Version 1.2 file format: *.JYN files

As mentioned above, JYN files are composed of a DCP file, with a MAG file tacked on immediately afterwards. To extract the DCP and MAG files, it is necessary to load the file at, say, 49152, and to calculate the DCP length (using the number of pages value stored 14 bytes into the file), poke the DCP section into memory at 16384, and then move the rest of the file down in memory to 38300.

The length of the DCP file is calculated as: (NUMPAGES*3)+18.

Text compression method

Document text in a MAG file is compressed using a combination of run-length compression and tokenised strings. Each page is 1344 bytes long (64 characters per line, 21 lines per page). Data bytes below 128 are passed directly to the output routine, bytes with the value 128 are passed onto the run-length subroutine, and the rest (129-255) are passed onto the detokenisation routine. Arguably, better compression could be provided by allowing tokens to also have the value 0-31 – increasing the total number of tokens available from 127 to 159 – but as this is not done by the Document Reader, and there are no further incarnations of the original reader program planned, this is a moot point.

Run-Length Compression

Spaces are run-length compressed. The compressor works by looking for a string of 3 or more spaces in the document text. (In Outwrite and Tasword format text, 64 spaces are used for a blank line, so this can lead to considerable savings). Thus, whenever a code of &80 hex is found in the compressed text, the next byte is taken, and this is used as a counter to print spaces to the screen. Occurences of two consecutive spaces are compressed by the tokenising routine.

For example, if the data in the text stream was (in hexadecimal):

21 41 45 80 10 82 83 41

This would print out as:

!AE<TOKEN01><TOKEN02>A

The &80 hex is the compressed space character key, and the byte after it – &10 – indicates that 16 spaces are to be printed out. The &82 and &83 bytes are compressed tokens, and until we have gone over the operation of the detokeniser, I have printed them as <TOKEN>.

Tokenising Compression

Tokenising compression works by replacing strings of characters with a reference to a dictionary. This dictionary can then be used to recreate the text. It's similar to short-hand, except for efficiency, we do not use just whole words, but word fragments as well.

To compress the text, the tokeniser looks through the uncompressed data to see if it matches one of the "words" or tokens in the dictionary. (This is similar to what BASIC does when the editor puts a line into the program. BASIC does it for two reasons; to save space, and to make it much quicker to actually run a program. We're only doing it for space reasons).

If a token is found, 129 is added to the token's dictionary reference number, and it takes the place of the equivalent text in the compressed data. Thus all it's necessary to do to decompress the tokens, is to have a copy of the appropriate dictionary, and then to use the token data to access that table.

In the tables below, the dictionary entries are referenced by the normalised (ie with 129 subtracted from them) token numbers; this is to make it easier to write a fast decompressor routine. Each token has bit 7 set on the last character, to mark that the end of the token has been reached.

In the above example text string, the compressed data was:

21 41 45 80 10 82 83 41

And the uncompressed string was:

!AE<TOKEN01><TOKEN02>A

For version 1.0 and version 1.1, we can now write this as:

!AEscreensscreenA

And for version 1.2, it is now:

!AEouldouseA

The decompressed tokens are in bold.

Version 1.0 and 1.1 Token Dictionary

	00	01	02	03
00	address	screens	screen	issue
04	memory	screen	don't	SAMCO
08	SAMCo	Coupe	FRED	bytes
0C	data	it's	from	SAM
10	©199	code	Code	Data
14	ould	out	had	Coupe
18	SAMCO	SAMCo	The	the
1C	tion	at	empt	©199
20	comp	Comp	cons	Cons
24	...	you	'll	ere
28	You	it	.)	n't
2C	ity	At	199	ing
30	een	and	And	ght
34	mag	pro	oum	ove
38	age	-	'm	's
3C	You	I	ant	ial
40		(	er	,
44		.	!	?
48	A	or	ss	ee
4C	ch	sh	un	ly
50	th	Th	To	to
54	ow	qu	Qu	Be
58	be	Up	up	Re
5C	re	en	En	us
60	Us	ed	oo	."
64	!"	?"	;	:
68	)	pe	Pe	ir
6C	Ir	my	pp	I
70	dd	ea	ff	ss
74	it	rr	at	At
78	e	y	ic	<No Token>

	00	01	02	03
00	you'll	ould	ouse	cons
04	comp	I'll	entr	ight
08		ent	ing	out
0C	ang	cei	ial	ant
10	mag	pro	age	I'm
14	'll	had	n't	ean
18	eem	ove	I'd	een
1C	all	oup	SAM	the
20	The	dis	key	ave
24	opy	oil	air	eer
28	ure	ion	vis	ban
2C	mon	hor	ard	ish
30	nal		.	,
34	's	om	sh	ch
38	ew	ng	ic	tr
3C	cr	it	ff	ss
40	ee	oo	ou	ie
44	ei	'm	nt	fl
48	ph	qu	be	up
4C	re	en	us	ed
50	to	ow	rr	ea
54	ar	pe	mu	th
58	Th	ll	ff	In
5C	in	pp	my	I
60	or	on	et	sc
64	ut	ex	ce	ck
68	at	At	A	a
6C	It	is	Is	su
70	Co	er	de	di
74	bi	ey	sp	go
78	aw	ay	il	op
7C	an	oc	id	<No Token>