How to query and extract PDF metadata and metrics in Java
The JPedal library can be used to query and extract metadata from PDF files. There are several methods in the PdfUtilities class. View the PdfUtilities Javadoc here.
To get started, create an instance of the PdfUtilities class from either a file or a byte array.
// Load from file
final PdfUtilities utilities = new PdfUtilities("inputFile.pdf");
// Load from byte array
final byte[] pdfBytes;
final PdfUtilities utilities = new PdfUtilities(pdfBytes);
If the file is encrypted, you must supply the password.
utilities.setPassword("password");
Next you will need to decode the file so you can access its metadata. You should also close the file after you have finished reading it.
if (utilities.openPDFFile()) {
// Add metadata query methods here
}
utilities.closePDFfile();
Now that the file is decoded, call any of the below methods.
Get the page count
final int numPages = utilities.getPageCount();
Page numbers in PDF files start from 1, so the result of
getPageCount()is also the page number of the last page in the document.
Get the page dimensions
final int page = 1;
final PdfUtilities.PageUnits units = PdfUtilities.PageUnits.Pixels;
final PdfUtilities.PageSizeType box = PdfUtilities.PageSizeType.CropBox;
final float[] pageDimensions = utilities.getPageDimensions(page, units, box);
Page dimensions can be returned as centimeters, inches, or pixels.
You can either query the page’s crop box or media box. What’s the difference?
Get the PDF version
final String version = utilities.getPDFVersion();
Get all the document properties
final Map<String, String> documentProperties = utilities.getDocumentPropertyStringValuesAsMap();
The returned map uses the property name as the key and the entry value contains the properties value.
Document properties are deprecated in PDF-2.0 in favour of metadata fields.
Get all the metadata fields
final String documentMetadata = utilities.getDocumentPropertyFieldsInXML();
Check if the file contains any embedded fonts
final boolean hasEmbeddedFonts = utilities.hasEmbeddedFonts();
Get all the font data
final Map<Integer, String> documentFontData = utilities.getAllFontDataForDocument();
The returned map uses the page number as the key and the entry value contains details about the fonts for that page.
Get the font data for a page
final int page = 1;
final String fontDataForPage = utilities.getFontDataForPage(page);
Determine if the file contains marked content
final boolean containsMarkedContent = utilities.isMarkedContent();
Get the PDF permissions
final int permissions = utilities.getPdfFilePermissions();
PdfUtilities.showPermissionsAsString(permissions);
Get the image data for a page
final int page = 1;
final String imageDataForPage = utilities.getXImageDataForPage(page);
Get the number of commands for a page
final int page = 1;
final int commandsForPage = utilities.getCommandCountForPageStream(page);
Learn more
Check out our GitHub profile for a full example project.