Expected string 'null' but missed at character 'u' at offset 6376

    Java - Issue with data extraction from PDF (PDFBox - 2.02)

    Stack Overflow | 6 months ago | iCoder
    I am trying to extract text from PDFs. Extracting text from the test file causes exceptions to be thrown. The first: Exception in thread "main" java.lang.RuntimeException: Value is not an integer: 636121514401477526485946144 at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext( at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext( at org.apache.pdfbox.util.PDFStreamEngine.processSubStream( at org.apache.pdfbox.util.PDFStreamEngine.processSubStream( at org.apache.pdfbox.util.PDFStreamEngine.processStream( at org.apache.pdfbox.util.PDFTextStripper.processPage( at org.apache.pdfbox.util.PDFTextStripper.processPages( at org.apache.pdfbox.util.PDFTextStripper.writeText( Caused by: Value is not an integer: 636121514401477526485946144 at org.apache.pdfbox.cos.COSNumber.get( at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken( at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000( at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext( Code to cause above exception: PDFTextStripper ts = new PDFTextStripper(); PrintWriter out = new PrintWriter(new FileWriter(new File ("020747.txt"))); PDDocument doc = PDDocument.load(new File("020747.pdf").toURI().toURL(), true); ts.setForceParsing(true); ts.writeText(doc, out); Using the following code causes a different exception until org.apache.pdfbox.baseParser.pushBackSize is increased (only tested 1024768). After it is increased I get basically the same exception as above PrintWriter out = new PrintWriter(new FileWriter(new File("020747.txt"))); PDFParser parser = new PDFParser(new FileInputStream(new File("020747.pdf"))); parser.parse(); PDFTextStripper ts = new PDFTextStripper(); ts.setForceParsing(true); ts.writeText(parser.getPDDocument(), out);

    java.lang.RuntimeException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 16574
    Root Cause Analysis


      at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString()
    2. Apache PDFBox
      1. org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(
      2. org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(
      3. org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(
      4. org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(
      5. org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(
      5 frames
    3. org.apache.pdfbox
      1. org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(
      2. org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(
      3. org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(
      4. org.apache.pdfbox.text.PDFTextStreamEngine.processPage(
      5. org.apache.pdfbox.text.PDFTextStripper.processPage(
      6. org.apache.pdfbox.text.PDFTextStripper.processPages(
      7. org.apache.pdfbox.text.PDFTextStripper.writeText(
      8. org.apache.pdfbox.text.PDFTextStripper.getText(
      8 frames
    4. main
      1. main.Test.readPDF(
      2. main.Test.main(
      2 frames
    5. Java RT
      1. sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      2. sun.reflect.NativeMethodAccessorImpl.invoke(
      3. sun.reflect.DelegatingMethodAccessorImpl.invoke(
      4. java.lang.reflect.Method.invoke(
      4 frames
    6. IDEA
      1. com.intellij.rt.execution.application.AppMain.main(
      1 frame