Apache Software Foundation > Apache POI
 

POI-HPBF - Java API To Access Microsoft Publisher Format Files

Overview

Overview

HPBF is the POI Project's pure Java implementation of the Publisher file format.

Currently, HPBF is in an early stage, whilst we try to figure out the file format. So far, we have basic text extraction support, and are able to read some parts within the file. Writing is not yet supported, as we are unable to make sense of the Contents stream, which we think has lots of offsets to other parts of the file.

Our initial aim is to produce a text extractor for the format (now done), and be able to extract hyperlinks from within the document (partly supported). Additional low level code to process the file format may follow, if there is demand and developer interest warrants it.

Text Extraction is available via the org.apache.poi.hpbf.extractor.PublisherTextExtractor class.

At this time, there is no usermodel api or similar. There is only low level support for certain parts of the file, but by no means all of it.

Our current understanding of the file format is documented here.

As of 2017, we are unaware of a public format specification for Microsoft Publisher .pub files. This format was not included in the Microsoft Open Specifications Promise with the rest of the Microsoft Office file formats. As of 2009 and 2016, Microsoft had no plans to document the .pub file format. If this changes in the future, perhaps we will see a spec published on the Microsoft Office File Format Open Specification Technical Documentation.

Note
This code currently lives the scratchpad area of the POI SVN repository. To use this component, ensure you have the Scratchpad Jar on your classpath, or a dependency defined on the poi-scratchpad artifact - the main POI jar is not enough! See the POI Components Map for more details.

by Nick Burch