Apache POI - Component Overview
Apache POI Project Components
The Apache POI project is the master project for developing pure Java ports of file formats based on Microsoft's OLE 2 Compound Document Format. OLE 2 Compound Document Format is used by Microsoft Office Documents, as well as by programs using MFC property sets to serialize their document objects.
Apache POI is also the master project for developing pure Java ports of file formats based on Office Open XML (ooxml). OOXML is part of an ECMA / ISO standardisation effort. This documentation is quite large, but you can normally find the bit you need without too much effort! ECMA-376 standard is here, and is also under the Microsoft OSP.
POIFS for OLE 2 Documents
POIFS is the oldest and most stable part of POI. It is our port of the OLE 2 Compound Document Format to pure Java. It supports both read and write functionality. All of our components for the binary (non-XML) Microsoft Office formats ultimately rely on it by definition. Please see the POIFS project page for more information.
HSSF and XSSF for Excel Documents
HSSF is our port of the Microsoft Excel 97 (-2003) file format (BIFF8) to pure Java. XSSF is our port of the Microsoft Excel XML (2007+) file format (OOXML) to pure Java. SS is a package that provides common support for both formats with a common API. They both support read and write capability. Please see the HSSF+XSSF project page for more information.
HWPF and XWPF for Word Documents
HWPF is our port of the Microsoft Word 97 (-2003) file format to pure Java. It supports read, and limited write capabilities. It also provides simple text extraction support for the older Word 6 and Word 95 formats. Please see the HWPF project page for more information. This component remains in early stages of development. It can already read and write simple files.
We are also working on the XWPF for the WordprocessingML (2007+) format from the OOXML specification. This provides read and write support for simpler files, along with text extraction capabilities.
HSLF and XSLF for PowerPoint Documents
HSLF is our port of the Microsoft PowerPoint 97(-2003) file format to pure Java. It supports read and write capabilities. Please see the HSLF project page for more information.
We are also working on the XSLF for the PresentationML (2007+) format from the OOXML specification.
HPSF for OLE 2 Document Properties
HPSF is our port of the OLE 2 property set format to pure Java. Property sets are mostly use to store a document's properties (title, author, date of last modification etc.), but they can be used for application-specific purposes as well.
HPSF supports both reading and writing of properties.
Please see the HPSF project page for more information.
HDGF and XDGF for Visio Documents
HDGF is our port of the Microsoft Visio 97(-2003) file format to pure Java. It currently only supports reading at a very low level, and simple text extraction. Please see the HDGF / Diagram project page for more information.
XDGF is our port of the Microsoft Visio XML (.vsdx) file format to pure Java. It has slightly more support than HDGF. Please see the XDGF / Diagram project page for more information.
HPBF for Publisher Documents
HPBF is our port of the Microsoft Publisher 98(-2007) file format to pure Java. It currently only supports reading at a low level for around half of the file parts, and simple text extraction. Please see the HPBF project page for more information.
HMEF for TNEF (winmail.dat) Outlook Attachments
HMEF is our port of the Microsoft TNEF (Transport Neutral Encoding Format) file format to pure Java. TNEF is sometimes used by Outlook for encoding the message, and will typically come through as winmail.dat. HMEF currently only supports reading at a low level, but we hope to add text and attachment extraction. Please see the HMEF project page for more information.
HSMF for Outlook Messages
HSMF is our port of the Microsoft Outlook message file format to pure Java. It currently only some of the textual content of MSG files, and some attachments. Further support and documentation is coming in slowly. For now, users are advised to consult the unit tests for example use. Please see the HSMF project page for more information.
Microsoft has recently added the Outlook file format to its OSP. More information is now available making implementing this API an easier task.
The Apache POI distribution consists of support for many document file formats. This support is provided in several Jar files. Not all of the Jars are needed for every format. The following tables show the relationships between POI components, Maven repository tags, and the project's Jar files.
|Component||Application type||Maven artifactId||Notes|
|POIFS||OLE2 Filesystem||poi||Required to work with OLE2 / POIFS based files|
|HPSF||OLE2 Property Sets||poi|
|HSSF||Excel XLS||poi||For HSSF only, if common SS is needed see below|
|DDF||Escher common drawings||poi|
|OpenXML4J||OOXML||poi-ooxml plus either poi-ooxml-schemas or
ooxml-schemas and ooxml-security
|See notes below for differences between these options|
|Common SL||PowerPoint PPT and PPTX||poi-scratchpad and poi-ooxml||SL code is in the core POI jar, but implementations are in poi-scratchpad and poi-ooxml.|
|Common SS||Excel XLS and XLSX||poi-ooxml||WorkbookFactory and friends all require poi-ooxml, not just core poi|
This table maps artifacts into the jar file name. "version-yyyymmdd" is the POI version stamp. You can see what the latest stamp is on the downloads page.
|poi||commons-logging, commons-codec, commons-collections (since POI 3.15 beta 3), commons-math (since POI 4.0.0), log4j||poi-version-yyyymmdd.jar|
|poi-examples||poi, poi-scratchpad, poi-ooxml||poi-examples-version-yyyymmdd.jar|
For signing: bcpkix-jdk15on, bcprov-jdk15on, xmlsec, slf4j-api
Apache commons-math3 was added as a dependency in POI 4.0.0.
poi-ooxml requires poi-ooxml-schemas. This is a substantially smaller version of the ooxml-schemas jar (ooxml-schemas-1.3.jar for POI 3.14 or later, ooxml-schemas-1.1.jar for POI 3.7 up to POI 3.13, ooxml-schemas-1.0.jar for POI 3.5 and 3.6). The larger ooxml-schemas jar is normally only required for development. Similarly, the ooxml-security jar, which contains all of the classes relating to encryption and signing, is normally only required for development. A subset of its contents are in poi-ooxml-schemas. This JAR is ooxml-security-1.1.jar for POI 3.14 onwards and ooxml-security-1.0.jar prior to that.
The OOXML jars require a stax implementation, but now that Apache POI requires Java 6, that dependency is provided by the JRE and no additional stax jars are required. The OOXML jars used to require DOM4J, but the code has now been changed to use JAXP and no additional dom4j jars are required. By the way, look at this FAQ if you have problems when using a non-Oracle JDK.
The ooxml schemas jars are compiled with Apache XMLBeans 2.3, and so can be used at runtime with any version of XMLBeans from 2.3 or newer. Wherever possible though, we recommend that you use XMLBeans 2.6.0 with Apache POI, and that is the version now shipped in the binary release packages. If you have issues with redefined classes with XMLBeans 2.6, ask on the developer mailing list for solutions.
The POI Browser is a very simple Swing GUI tool that displays the internal structure of a Microsoft Office file and especially the property set streams. Further information and instructions how to execute it can be found in the POI source code (viewvc).
All of the examples are included in POI distributions as a poi-examples artifact.
Running POI on other JVM languages
POI can be run on most languages that run on the JVM. For code examples, see Running POI on other JVM languages
Besides the "official" components outlined above there is some further software distributed with POI. This is called "contributed" software. It is not explicitly recommended or even maintained by the POI team, but it might still be useful to you.
by Andrew C. Oliver, Rainer Klute, David Fisher