|Apache | POI||
POI-HSLF - A Quick Guide
Basic Text Extraction#
For basic text extraction, make use of org.apache.poi.hslf.extractor.PowerPointExtractor. It accepts a file or an input stream. The getText() method can be used to get the text from the slides, and the getNotes() method can be used to get the text from the notes. Finally, getText(true,true) will get the text from both.
Specific Text Extraction#
To get specific bits of text, first create a org.apache.poi.hslf.usermodel.HSLFSlideShow (from a org.apache.poi.hslf.usermodel.HSLFSlideShowImpl, which accepts a file or an input stream). Use getSlides() and getNotes() to get the slides and notes. These can be queried to get their page ID (though they should be returned in the right order).
You can then call getTextParagraphs() on these, to get their blocks of text. (A list of HSLFTextParagraph normally holds all the text in a given area of the page, eg in the title bar, or in a box). From the HSLFTextParagraph, you can extract the text, and check what type of text it is (eg Body, Title). You can also call getTextRuns(), which will return the HSLFTextRuns that make up the TextParagraph. A HSLFTextRun is a text fragment, having the same character formatting. The paragraph formatting is defined in the parent HSLFTextParagraph.
Poor Quality Text Extraction#
If speed is the most important thing for you, you don't care about getting duplicate blocks of text, you don't care about getting text from master sheets, and you don't care about getting old text, then org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor might be of use.
QuickButCruddyTextExtractor doesn't use the normal record parsing code, instead it uses a tree structure blind search method to get all text holding records. You will get all the text, including lots of text you normally wouldn't ever want. However, you will get it back very very fast!
There are two ways of getting the text back. getTextAsString() will return a single string with all the text in it. getTextAsVector() will return a vector of strings, one for each text record found in the file.
It is possible to change the text via HSLFTextParagraph.setText(List<HSLFTextParagraph>,String) or HSLFTextRun.setText(String). It is possible to add additional TextRuns with HSLFTextParagraph.appendText(List<HSLFTextParagraph>,String,boolean) or HSLFTextParagraph.addTextRun(HSLFTextRun)
When calling HSLFTextParagraph.setText(List<HSLFTextParagraph>,String), all the text will end up with the same formatting. When calling HSLFTextRun.setText(String), the text will retain the old formatting of that HSLFTextRun.
You may add new slides by calling HSLFSlideShow.createSlide(), which will add a new slide to the end of the SlideShow. It is possible to re-order slides with HSLFSlideShow.reorderSlide(...).
Guide to key classes#