ack/doc/ego/ic/ic2

.NH 2
Representation of complex data structures in a sequential file
.PP
Most programmers are quite used to deal with
complex data structures, such as
arrays, graphs and trees.
There are some particular problems that occur
when storing such a data structure
in a sequential file.
We call data that is kept in
main memory
.UL internal
,as opposed to
.UL external
data
that is kept in a file outside the program.
.sp
We assume a simple data structure of a
scalar type (integer, floating point number)
has some known external representation.
An
.UL array
having elements of a scalar type can be represented
externally easily, by successively
representing its elements.
The external representation may be preceded by a
number, giving the length of the array.
Now, consider a linear, singly linked list,
the elements of which look like:
.DS
record
        data: scalar_type;
        next: pointer_type;
end;
.DE
It is significant to note that the "next"
fields of the elements only have a meaning within
main memory.
The field contains the address of some location in
main memory.
If a list element is written to a file in
some program,
and read by another program,
the element will be allocated at a different
address in main memory.
Hence this address value is completely
useless outside the program.
.sp
One may represent the list by ignoring these "next" fields
and storing the data items in the order they are linked.
The "next" fields are represented \fIimplicitly\fR.
When the file is read again,
the same list can be reconstructed.
In order to know where the external representation of the
list ends,
it may be useful to put the length of
the list in front of it.
.sp
Note that arrays and linear lists have the
same external representation.
.PP
A doubly linked, linear list,
with elements of the type:
.DS
record
        data: scalar_type;
        next,
        previous: pointer_type;
end
.DE
can be represented in precisely the same way.
Both the "next" and the "previous" fields are represented
implicitly.
.PP
Next, consider a binary tree,
the nodes of which have type:
.DS
record
        data: scalar_type;
        left,
        right: pointer_type;
end
.DE
Such a tree can be represented sequentially,
by storing its nodes in some fixed order, e.g. prefix order.
A special null data item may be used to
denote a missing left or right son.
For example, let the scalar type be integer,
and let the null item be 0.
Then the tree of fig. 3.1(a)
can be represented as in fig. 3.1(b).
.DS
.ft 5
                        4
                      /   \e
                    9      12
                  /  \e    /  \e
                12    3   4   6
                     / \e  \e  /
                     8  1  5 1
.ft R

Fig. 3.1(a) A binary tree


.ft 5
4 9 12 0 0 3 8 0 0 1 0 0 12 4 0 5 0 0 6 1 0 0 0
.ft R

Fig. 3.1(b) Its sequential representation
.DE
We are still able to represent the pointer fields ("left"
and "right") implicitly.
.PP
Finally, consider a general
.UL graph
, where each node has a "data" field and
pointer fields,
with no restriction on where they may point to.
Now we're at the end of our tale.
There is no way to represent the pointers implicitly,
like we did with lists and trees.
In order to represent them explicitly,
we use the following scheme.
Every node gets an extra field,
containing some unique number that identifies the node.
We call this number its
.UL id.
A pointer is represented externally as the id of the node
it points to.
When reading the file we use a table that maps
an id to the address of its node.
In general this table will not be completely filled in
until we have read the entire external representation of
the graph and allocated internal memory locations for
every node.
Hence we cannot reconstruct the graph in one scan.
That is, there may be some pointers from node A to B,
where B is placed after A in the sequential file than A.
When we read the node of A we cannot map the id of B
to the address of node B,
as we have not yet allocated node B.
We can overcome this problem if the size
of every node is known in advance.
In this case we can allocate memory for a node
on first reference.
Else, the mapping from id to pointer
cannot be done while reading nodes.
The mapping can be done either in an extra scan
or at every reference to the node.