Apache Software Foundation > Apache POI
 

Apache POI - POIFS - Design Document

POIFS Design Document

This document describes the design of the POIFS system. It is organized as follows:

Scope

This document is written as part of an iterative process. As that process is not yet complete, neither is this document.

Assumptions

The design of POIFS is not dependent on the code written for the proof-of-concept prototype POIFS package.

Design Considerations

As usual, the primary considerations in the design of the POIFS assumption involve the classic space-time tradeoff. In this case, the main consideration has to involve minimizing the memory footprint of POIFS. POIFS may be called upon to create relatively large documents, and in web application server, it may be called upon to create several documents simultaneously, and it will likely co-exist with other Serializer systems, competing with those other systems for space on the server.

We've addressed the risk of being too slow through a proof-of-concept prototype. This prototype for POIFS involved reading an existing file, decomposing it into its constituent documents, composing a new POIFS from the constituent documents, and writing the POIFS file back to disk and verifying that the output file, while not necessarily a byte-for-byte image of the input file, could be read by the application that generated the input file. This prototype proved to be quite fast, reading, decomposing, and re-generating a large (300K) file in 2 to 2.5 seconds.

While the POIFS format allows great flexibility in laying out the documents and the other internal data structures, the layout of the filesystem will be kept as simple as possible.

Design

The design of the POIFS is broken down into two parts: discussion of the classes and interfaces, and discussion of how these classes and interfaces will be used to convert an appropriate Java InputStream (such as an XML stream) to a POIFS output stream containing an HSSF document.

Classes and Interfaces

The classes and interfaces used in the POIFS are broken down as follows:

Package Contents
net.sourceforge.poi.poifs.storage Block classes and interfaces
net.sourceforge.poi.poifs.property Property classes and interfaces
net.sourceforge.poi.poifs.filesystem Filesystem classes and interfaces
net.sourceforge.poi.util Utility classes and interfaces

Block Classes and Interfaces

The block classes and interfaces are shownin the following class diagram.

Block Classes and Interfaces

Class/Interface Description
BATBlock The BATBlock class represents a single big block containing 128 BAT entries.
Its _fields array is used to read and write the BAT entries into the _data array.
Its createBATBlocks method is used to create an array of BATBlock instances from an array of int BAT entries.
Its calculateStorageRequirements method calculates the number of BAT blocks necessary to hold the specified number of BAT entries.
BigBlock The BigBlock class is an abstract class representing the common big block of 512 bytes. It implements BlockWritable, trivially delegating the writeBlocks method of BlockWritable to its own abstract writeData method.
BlockWritable The BlockWritable interface defines a single method, writeBlocks, that is used to write an implementation's block data to an OutputStream.
DocumentBlock The DocumentBlock class is used by a Document to holds its raw data. It also retains the number of bytes read, as this is used by the Document class to determine the total size of the data, and is also used internally to determine whether the block was filled by the InputStream or not.
The DocumentBlock constructor is passed an InputStream from which to fill its _data array.
The size method returns the number of bytes read (_bytes_read) when the instance was constructed.
The partiallyRead method returns true if the _data array was not completely filled, which may be interpreted by the Document as having reached the end of file point.
Typical use of the DocumentBlock class is like this:
while (true) {
DocumentBlock block = new DocumentBlock(stream);
blocks.add(block);
size += block.size();
if (block.partiallyRead()) {
break;
}
}
HeaderBlock The HeaderBlock class is used to contain the data found in a POIFS header.
Its IntegerField members are used to read and write the appropriate entries into the _data array.
Its setBATBlocks , setPropertyStart , and setXBATStart methods are used to set the appropriate fields in the _data array.
The calculateXBATStorageRequirements method is used to determine how many XBAT blocks are necessary to accommodate the specified number of BAT blocks.
PropertyBlock The PropertyBlock class is used to contain Property instances for the PropertyTable class.
It contains an array, _properties of 4 Property instances, which together comprise the 512 bytes of a BigBlock.
The createPropertyBlockArray method is used to convert a List of Property instances into an array of PropertyBlock instances. The number of Property instances is rounded up to a multiple of 4 by creating empty anonymous inner class extensions of Property.

Property Classes and Interfaces

The property classes and interfaces are shown in the following class diagram.

Property Classes and Interfaces

Class/Interface Description
Directory The Directory interface is implemented by the RootProperty class. It is not strictly necessary for the initial POIFS implementation, but when the POIFS supports directory elements, this interface will be more widely implemented, and so is included in the design at this point to ease the eventual support of directory elements.
Its methods are a getter/setter pair, getChildren , returning an Iterator of Property instances; and addChild , which will allow the caller to add another Property instance to the Directory's children.
DocumentProperty The DocumentProperty class is a trivial extension of Property and is used by Document to keep track of its associated entry in the PropertyTable.
Its constructor takes a name and the document size, on the assumption that the Document will not create a DocumentProperty until after it has created the storage for the document data and therefore knows how much data there is.
File The File interface specifies the behavior of reading and writing the next and previous child fields of a Property.
Property The Property class is an abstract class that defines the basic data structure of an element of the Property Table.
Its ByteField, ShortField, and IntegerField members are used to read and write data into the appropriate locations in the _raw_data array.
The _index member is used to hold a Propery instance's index in the List of Property instances maintained by PropertyTable, which is used to populate the child property of parent Directory properties and the next property and previous property of sibling File properties.
The _name , _next_file , and _previous_file members are used to help fill the appropriate fields of the _raw_data array.
Setters are provided for some of the fields (name, property type, node color, child property, size, index, start block), as well as a few getters (index, child property).
The preWrite method is abstract and is used by the owning PropertyTable to iterate through its Property instances and prepare each for writing.
The shouldUseSmallBlocks method returns true if the Property's size is sufficiently small - how small is none of the caller's business.
PropertyBlock See the description in PropertyBlock.
PropertyTable The PropertyTable class holds all of the DocumentProperty instances and the RootProperty instance for a Filesystem instance.
It maintains a List of its Property instances ( _properties ), and when prepared to write its data by a call to preWrite , it gets and holds an array of PropertyBlock instances ( _blocks) .
It also maintains its start block in its _start_block member.
It has a method, getRoot , to get the RootProperty, returning it as an implementation of Directory, and a method to add a Property, addProperty , and a method to get its start block, getStartBlock .
RootProperty The RootProperty class acts as the Directory for all of the DocumentProperty instance. As such, it is more of a pure directory entry than a proper root entry in the Property Table, but the initial POIFS implementation does not warrant the additional complexity of a full-blown root entry, and so it is not modeled in this design.
It maintains a List of its children, _children , in order to perform its directory-oriented duties.

Filesystem Classes and Interfaces

The property classes and interfaces are shown in the following class diagram.

Filesystem Classes and Interfaces

Class/Interface Description
Filesystem The Filesystem class is the top-level class that manages the creation of a POIFS document.
It maintains a PropertyTable instance in its _property_table member, a HeaderBlock instance in its _header_block member, and a List of its Document instances in its _documents member.
It provides methods for a client to create a document ( createDocument ), and a method to write the Filesystem to an OutputStream ( writeFilesystem ).
BATBlock See the description in BATBlock
BATManaged The BATManaged interface defines common behavior for objects whose location in the written file is managed by the Block Allocation Table.
It defines methods to get a count of the implementation's BigBlock instances ( countBlocks ), and to set an implementation's start block ( setStartBlock ).
BlockAllocationTable The BlockAllocationTable is an implementation of the POIFS Block Allocation Table. It is only created when the Filesystem is about to be written to an OutputStream.
It contains an IntList of block numbers for all of the BATManaged implementations owned by the Filesystem, _entries , which is filled by calls to allocateSpace .
It fills its array, _blocks , of BATBlock instances when its createBATBlocks method is called. This method has to take into account its own storage requirements, as well as those of the XBAT blocks, and so calls BATBlock.calculateStorageRequirements and HeaderBlock.calculateXBATStorageRequirements repeatedly until the counts returned by those methods stabilize.
The countBlocks method returns the number of BATBlock instances created by the preceding call to createBlocks.
BlockWritable See the description in BlockWritable
Document The Document class is used to contain a document, such as an HSSF workbook.
It has its own DocumentProperty ( _property ) and stores its data in a collection of DocumentBlock instances ( _blocks ).
It has a method, getDocumentProperty , to get its DocumentProperty.
DocumentBlock See the description in DocumentBlock
DocumentProperty See the description in DocumentProperty
HeaderBlock See the description in HeaderBlock
PropertyTable See the description in PropertyTable

Utility Classes and Interfaces

The utility classes and interfaces are shown in the following class diagram.

Utility Classes and Interfaces

Class/Interface Description
BitField The BitField class is used primarily by HSSF code to manage bit-mapped fields of HSSF records. It is not likely to be used in the POIFS code itself and is only included here for the sake of complete documentation of the POI utility classes.
ByteField The ByteField class is an implementation of FixedField for the purpose of managing reading and writing to a byte-wide field in an array of bytes.
FixedField The FixedField interface defines a set of methods for reading a field from an array of bytes or from an InputStream, and for writing a field to an array of bytes. Implementations typically require an offset in their constructors that, for the purposes of reading and writing to an array of bytes, makes sure that the correct bytes in the array are read or written.
HexDump The HexDump class is a debugging class that can be used to dump an array of bytes to an OutputStream. The static method dump takes an array of bytes, a long offset that is used to label the output, an open OutputStream, and an int index that specifies the starting index within the array of bytes.
The data is displayed 16 bytes per line, with each byte displayed in hexadecimal format and again in printable form, if possible (a byte is considered printable if its value is in the range of 32 ... 126).
Here is an example of a small array of bytes with an offset of 0x110:
00000110 C8 00 00 00 FF 7F 90 01 00 00 00 00 00 00 05 01 ................
00000120 41 00 72 00 69 00 61 00 6C 00 A.r.i.a.l.
IntegerField The IntegerField class is an implementation of FixedField for the purpose of managing reading and writing to an integer-wide field in an array of bytes.
IntList The IntList class is a work-around for functionality missing in Java (see https://developer.java.sun.com/developer/bugParade/bugs/4487555.html for details); it is a simple growable array of ints that gets around the requirement of wrapping and unwrapping ints in Integer instances in order to use the java.util.List interface.
IntList mimics the functionality of the java.util.List interface as much as possible.
LittleEndian The LittleEndian class provides a set of static methods for reading and writing shorts, ints, longs, and doubles in and out of byte arrays, and out of InputStreams, preserving the Intel byte ordering and encoding of these values.
LittleEndianConsts The LittleEndianConsts interface defines the width of a short, int, long, and double as stored by Intel processors.
LongField The LongField class is an implementation of FixedField for the purpose of managing reading and writing to a long-wide field in an array of bytes.
ShortField The ShortField class is an implementation of FixedField for the purpose of managing reading and writing to a short-wide field in an array of bytes.
ShortList The ShortList class is a work-around for functionality missing in Java (see https://developer.java.sun.com/developer/bugParade/bugs/4487555.html for details); it is a simple growable array of shorts that gets around the requirement of wrapping and unwrapping shorts in Short instances in order to use the java.util.List interface.
ShortList mimics the functionality of the java.util.List interface as much as possible.
StringUtil The StringUtil class manages the processing of Unicode strings.

Scenarios

This section describes the scenarios of how the POIFS classes and interfaces will be used to convert an appropriate XML stream to a POIFS output stream containing an HSSF document.

It is broken down as suggested by the following scenario diagram:

POIFS LifeCycle

Step Description
1 The Filesystem is created by the client application.
2 The client application tells the Filesystem to create a document, providing an InputStream and the name of the document. This may be repeated several times.
3 The client application asks the Filesystem to write its data to an OutputStream.

Initialization

Initialization of the POIFS system is shown in the following scenario diagram:

Initialization

Step Description
1 The Filesystem object, which is created for each request to convert an appropriate XML stream to a POIFS output stream containing an HSSF document, creates its PropertyTable.
2 The PropertyTable creates its RootProperty instance, making the RootProperty the first Property in its List of Property instances.
3 The Filesystem creates its HeaderBlock instance. It should be noted that the decision to create the HeaderBlock at Filesystem initialization is arbitrary; creation of the HeaderBlock could easily and harmlessly be postponed to the appropriate moment in writing the filesystem.

Creating a Document

Creating and adding a document to a POIFS system is shown in the following scenario diagram:

Add Document

Step Description
1 The Filesystem instance creates a new Document instance. It will store the newly created Document in a List of BATManaged instances.
2 The Document reads data from the provided InputStream, storing the data in DocumentBlock instances. It keeps track of the byte count as it reads the data.
3 The Document creates a DocumentProperty to keep track of its property data. The byte count is stored in the newly created DocumentProperty instance.
4 The Filesystem requests the newly created DocumentProperty from the newly created Document instance.
5 The Filesystem sends the newly created DocumentProperty to the Filesystem's PropertyTable so that the PropertyTable can add the DocumentProperty to its List of Property instances.
6 The Filesystem gets the RootProperty from its PropertyTable.
7 The Filesystem adds the newly created DocumentProperty to the RootProperty.

Although typical deployment of the POIFS system will only entail adding a single Document (the workbook) to the Filesystem, there is nothing in the design to prevent multiple Documents from being added to the Filesystem. This flexibility can be employed to write summary information document(s) in addition to the workbook.

Writing the Filesystem

Writing the filesystem is shown in the following scenario diagram:

Writing the Filesystem

Step Description
1 The Filesystem adds the PropertyTable to its List of BATManaged instances and calls the PropertyTable's preWrite method. The action taken by the PropertyTable is shown in the PropertyTable preWrite scenario diagram.
2 The Filesystem creates the BlockAllocationTable.
3 The Filesystem gets the block count from the BATManaged instance. These three steps are repeated for each BATManaged instance in the Filesystem's List of BATManaged instances (i.e., the Documents, in order of their addition to the Filesystem, followed by the PropertyTable).
4 The Filesystem sends the block count to the BlockAllocationTable, which adds the appropriate entries to is IntList of entries, returning the starting block for the newly added entries.
5 The Filesystem gives the start block number to the BATManaged instance. If the BATManaged instance is a Document, it sets the start block field in its DocumentProperty.
6 The Filesystem tells the BlockAllocationTable to create its BatBlocks.
7 The Filesystem gives the BAT information to the HeaderBlock so that it can set its BAT fields and, if necessary, create XBAT blocks.
8 If the filesystem is unusually large (over 7MB), the HeaderBlock will create XBAT blocks to contain the BAT data that it cannot hold directly. In this case, the Filesystem tells the HeaderBlock where those additional blocks will be stored.
9 The Filesystem gives the PropertyTable start block to the HeaderBlock.
10 The Filesystem tells the BlockWritable instance to write its blocks to the provided OutputStream.
This step is repeated for each BlockWritable instance, in this order:
  1. The HeaderBlock.
  2. Each Document, in the order in which it was added to the Filesystem.
  3. The PropertyTable.
  4. The BlockAllocationTable
  5. The XBAT blocks created by the HeaderBlock, if any.

PropertyTable preWrite scenario diagram

PropertyTable preWrite scenario diagram

Step Description
1 The PropertyTable calls setIndex for each of its Property instances, so that each Property now knows its index within the PropertyTable's List of Property instances.
2 The PropertyTable requests the PropertyBlock class to create an array of PropertyBlock instances.
3 The PropertyBlock calculates the number of empty Property instances it needs to create and creates them. The algorithm for the number to create is:
block_count = (properties.size() + 3) / 4;
emptyPropertiesNeeded = (block_count * 4) - properties.size();
4 The PropertyBlock creates the required number of PropertyBlock instances from the List of Property instances, including the newly created empty Property instances.
5 The PropertyTable calls preWrite on each of its Property instances. For DocumentProperty instances, this call is a no-op. For the RootProperty, the action taken is shown in the RootProperty preWrite scenario diagram.

RootProperty preWrite scenario diagram

RootProperty preWrite scenario diagram

Step Description
1 The RootProperty sets its child property with the index of the child Property that is first in its List of children.
2 The RootProperty sets its child's next property field with the index of the child's next sibling in the RootProperty's List of children. If the child is the last in the List, its next property field is set to -1. These two steps are repeated for each File in the RootProperty's List of children.
3 The RootProperty sets its child's previous property field with a value of -1.