java.lang.NullPointerException

Sakai JIRA | Andrea Bollini | 6 years ago
  1. 0

    We have found that update the pdfbox library to the last stable version (1.2.1) solve all our current issues with pdf text extraction and improve performance. This could help people that want rely on the DSpace "out-of-box" pdf extractor without using XPDF. Below some samples of exception that go away updating the pdfbox version. Patch attached against trunk r5439 == java.io.IOException: Error: Could not find font(COSName{F1.0}) in map={} at org.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:83) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:243) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) ==== java.io.IOException: Unknown colorspace array type:COSName{DeviceRGB} at org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:116) at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:193) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.NullPointerException at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194) at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182) at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.util.zip.ZipException: unknown compression method at java.util.zip.InflaterInputStream.read(Unknown Source) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:66) at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:450) at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908) at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at java.io.PushbackInputStream.unread(Unknown Source) at org.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:524) at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:873) at org.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:94) at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:451) at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908) at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139) === java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(Unknown Source) at java.util.zip.InflaterInputStream.read(Unknown Source) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

    Sakai JIRA | 6 years ago | Andrea Bollini
    java.lang.NullPointerException
  2. Speed up your debug routine!

    Automated exception search integrated into your IDE

  3. 0

    I've installed terrier 3.5 on windows xp and started desktop_terrier. After that, I choose a directory to index and started indexing. After about 50 documents terrier throws an execption, because it was not able to index a special pdf-dcument (some other pdfs worked). Is there any chance to tell terrier to skip such exceptions and to go on with indexing ? here is the execption/log: Set TERRIER_HOME to be D:\Java\terrier WARNING: The file terrier.properties was not found at location D:\Java\terrier\etc\terrier.properties Assuming the value of terrier.home from the corresponding system property. INFO - Deleting: D:\Java\terrier\var\index\data_1.direct.bf: true INFO - Deleting: D:\Java\terrier\var\index\data_1.document.fsarrayfile: true INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.idx: true INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.zdata: true INFO - creating the data structures data_1 INFO - BlockIndexer creating direct index INFO - NEXT: D:\Virtual Machines\host\Privat\_dokumente ..... java.lang.NullPointerException at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:254) at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:773) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:139) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:211) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:185) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:161) at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:111) at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130) at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342) at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303) at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:357) at org.terrier.indexing.Indexer.index(Indexer.java:346) at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129) at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114) at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498) ERROR - An unexpected exception occured while indexing. Indexing has been aborted. java.lang.NullPointerException at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97) at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76) at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221) at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371) at org.terrier.indexing.Indexer.index(Indexer.java:346) at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129) at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114) at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)

    JIRA | 5 years ago | Ulrich Kaemmerer
    java.lang.NullPointerException
  4. 0

    [CONF-18962] Some pdf files don't get correctly indexed - Atlassian JIRA

    atlassian.com | 8 months ago
    com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document

    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. java.lang.NullPointerException

      No message provided

      at org.pdfbox.pdmodel.PDPageNode.getAllKids()
    2. PDFBox - Java PDF Library
      PDFTextStripper.writeText
      1. org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
      2. org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
      3. org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
      4. org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      4 frames
    3. DSpace Kernel :: API and Implementation
      PDFFilter.getDestinationStream
      1. org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)
      1 frame