com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document

Atlassian JIRA | Andrew Moise | 7 years ago
  1. 0

    My site's content index is only partially built, resulting in missing pages in search results. I see http://jira.atlassian.com/browse/CONF-18452 has been filed to fix the failure to completely index when there's a problem with a particular page, but I also wanted to file bugs about the underlying issues. This issue is a problem indexing a particular .pdf document: 2010-02-22 11:10:26,019 WARN [Indexer: 2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: PS3_Produ ct_Guidelines_1.0_SCEE_English.pdf v.1 (5144583) kreiner) -- url: /confluence/admin/reindex.action | userName: moise | referer: https://qix.demiurgestudios.com/confluence/admin/search-indexes.action | action: reind ex com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:65) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:39) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:43) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:41) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:72) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.indexCollection(DefaultObjectQueueWorker.java:73) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker$1.doInTransactionWithoutResult(DefaultObjectQueueWorker.java:61) at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:127) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.run(DefaultObjectQueueWorker.java:50) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) Caused by: java.io.IOException: Unknown font subtype=COSName{} at org.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:103) at org.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:135) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:178) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:49) ... 16 more 2

    Atlassian JIRA | 7 years ago | Andrew Moise
    com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document
  2. 0

    My site's content index is only partially built, resulting in missing pages in search results. I see http://jira.atlassian.com/browse/CONF-18452 has been filed to fix the failure to completely index when there's a problem with a particular page, but I also wanted to file bugs about the underlying issues. This issue is a problem indexing a particular .pdf document: 2010-02-22 11:10:26,019 WARN [Indexer: 2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: PS3_Produ ct_Guidelines_1.0_SCEE_English.pdf v.1 (5144583) kreiner) -- url: /confluence/admin/reindex.action | userName: moise | referer: https://qix.demiurgestudios.com/confluence/admin/search-indexes.action | action: reind ex com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:65) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:39) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:43) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:41) at com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:72) at com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43) at com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.indexCollection(DefaultObjectQueueWorker.java:73) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker$1.doInTransactionWithoutResult(DefaultObjectQueueWorker.java:61) at org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:127) at com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.run(DefaultObjectQueueWorker.java:50) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595) Caused by: java.io.IOException: Unknown font subtype=COSName{} at org.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:103) at org.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:135) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:178) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:49) ... 16 more 2

    Atlassian JIRA | 7 years ago | Andrew Moise
    com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document
  3. Speed up your debug routine!

    Automated exception search integrated into your IDE

  4. 0

    We have found that update the pdfbox library to the last stable version (1.2.1) solve all our current issues with pdf text extraction and improve performance. This could help people that want rely on the DSpace "out-of-box" pdf extractor without using XPDF. Below some samples of exception that go away updating the pdfbox version. Patch attached against trunk r5439 == java.io.IOException: Error: Could not find font(COSName{F1.0}) in map={} at org.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:83) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:243) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) ==== java.io.IOException: Unknown colorspace array type:COSName{DeviceRGB} at org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:116) at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:193) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.util.zip.ZipException: unknown compression method at java.util.zip.InflaterInputStream.read(Unknown Source) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:66) at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:450) at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908) at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at java.io.PushbackInputStream.unread(Unknown Source) at org.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:524) at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:873) at org.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:94) at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:451) at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908) at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(Unknown Source) at java.util.zip.InflaterInputStream.read(Unknown Source) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

    Sakai JIRA | 6 years ago | Andrea Bollini
    java.io.IOException: Unknown colorspace array type:COSName{DeviceRGB}
  5. 0

    PDFBox filling form exception

    Coderanch | 3 years ago | giuseppe gio
    java.io.IOException: Cannot create font as /SubType is not set.

    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. java.io.IOException

      Unknown font subtype=COSName{}

      at org.pdfbox.pdmodel.font.PDFontFactory.createFont()
    2. PDFBox - Java PDF Library
      PDFTextStripper.writeText
      1. org.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:103)
      2. org.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:135)
      3. org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:178)
      4. org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
      5. org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
      6. org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
      7. org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
      7 frames
    3. com.atlassian.bonnie
      BaseAttachmentContentExtractor.addFields
      1. com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:49)
      2. com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:39)
      2 frames
    4. com.atlassian.confluence
      ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields
      1. com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:43)
      1 frame
    5. com.atlassian.bonnie
      BaseDocumentBuilder.getDocument
      1. com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
      1 frame
    6. com.atlassian.confluence
      AddDocumentIndexTask.perform
      1. com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:102)
      2. com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:41)
      2 frames
    7. com.atlassian.bonnie
      TempIndexWriter.perform
      1. com.atlassian.bonnie.index.TempIndexWriter.perform(TempIndexWriter.java:72)
      1 frame
    8. com.atlassian.confluence
      DefaultObjectQueueWorker$1.doInTransactionWithoutResult
      1. com.atlassian.confluence.search.lucene.TempIndexWriterStrategy.perform(TempIndexWriterStrategy.java:43)
      2. com.atlassian.confluence.search.lucene.tasks.TempIndexBackedIndexTaskPerformer.perform(TempIndexBackedIndexTaskPerformer.java:21)
      3. com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.indexCollection(DefaultObjectQueueWorker.java:73)
      4. com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker$1.doInTransactionWithoutResult(DefaultObjectQueueWorker.java:61)
      4 frames
    9. Spring Tx
      TransactionTemplate.execute
      1. org.springframework.transaction.support.TransactionCallbackWithoutResult.doInTransaction(TransactionCallbackWithoutResult.java:33)
      2. org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:127)
      2 frames
    10. com.atlassian.confluence
      DefaultObjectQueueWorker.run
      1. com.atlassian.confluence.search.lucene.DefaultObjectQueueWorker.run(DefaultObjectQueueWorker.java:50)
      1 frame
    11. Java RT
      Thread.run
      1. java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
      2. java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
      3. java.lang.Thread.run(Thread.java:595)
      3 frames