java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376

Stack Overflow | iCoder | 4 months ago
  1. 0

    Java - Issue with data extraction from PDF (PDFBox - 2.02)

    Stack Overflow | 4 months ago | iCoder
    java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376
  2. 0

    I am trying to extract text from PDFs. Extracting text from the test file http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf causes exceptions to be thrown. The first: Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Value is not an integer: 636121514401477526485946144 at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:187) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335) Caused by: java.io.IOException: Value is not an integer: 636121514401477526485946144 at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:104) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:351) at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182) Code to cause above exception: PDFTextStripper ts = new PDFTextStripper(); PrintWriter out = new PrintWriter(new FileWriter(new File ("020747.txt"))); PDDocument doc = PDDocument.load(new File("020747.pdf").toURI().toURL(), true); ts.setForceParsing(true); ts.writeText(doc, out); Using the following code causes a different exception until org.apache.pdfbox.baseParser.pushBackSize is increased (only tested 1024768). After it is increased I get basically the same exception as above PrintWriter out = new PrintWriter(new FileWriter(new File("020747.txt"))); PDFParser parser = new PDFParser(new FileInputStream(new File("020747.pdf"))); parser.parse(); PDFTextStripper ts = new PDFTextStripper(); ts.setForceParsing(true); ts.writeText(parser.getPDDocument(), out);

    Apache's JIRA Issue Tracker | 3 years ago | William Palmer
    java.lang.RuntimeException: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 16574
  3. Speed up your debug routine!

    Automated exception search integrated into your IDE

    1 unregistered visitors
    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. java.io.IOException

      Expected string 'null' but missed at character 'u' at offset 6376

      at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString()
    2. Apache PDFBox
      PDFStreamParser.parseNextToken
      1. org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1017)
      2. org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1000)
      3. org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:879)
      4. org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
      5. org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
      5 frames
    3. org.apache.pdfbox
      PDFTextStripper.getText
      1. org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
      2. org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
      3. org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
      4. org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
      5. org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
      6. org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
      7. org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
      8. org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
      8 frames
    4. main
      Test.main
      1. main.Test.readPDF(Test.java:170)
      2. main.Test.main(Test.java:76)
      2 frames
    5. Java RT
      Method.invoke
      1. sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      2. sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      3. sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      4. java.lang.reflect.Method.invoke(Method.java:498)
      4 frames
    6. IDEA
      AppMain.main
      1. com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
      1 frame