java.lang.RuntimeException: java.io.IOException: Unexpected character a(Expecting d)

JIRA | Michael Stack | 10 years ago
  1. 0

    The below was reported by Olaf Freyer up on the mailing list: With the release candidate I seem to be unable to use the v10 WARCReader via the console (not tested if it would fail when using via java, too). Here is how I'm used to use the WARCReader (updated to new package structure) My warcreader shell script basically contains: FOREGROUND='true' CLASS_MAIN='org.archive.io.warc.v10.WARCReader' JMX_OFF='off' $HERITRIX_HOME/bin/heritrix Now I do call: sh warcreader -f dump myWARC.warc Here is what I get: java.lang.ClassCastException: org.archive.io.warc.WARCReaderFactory$UncompressedWARCReader cannot be cast to org.archive.io.warc.v10.WARCReader at org.archive.io.warc.v10.WARCReaderFactory.get(WARCReaderFactory.java:61) at org.archive.io.warc.v10.WARCReader.main(WARCReader.java:298) Also note that I seem to be unable to use the "dump" option of the WARCReader of heritrix-1.10.2, too. Even though it at least starts up I get the following error: Exception processing myWARC.warc: java.io.IOException: Unexpected character a(Expecting d) java.lang.RuntimeException: java.io.IOException: Unexpected character a(Expecting d) at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:462) at org.archive.io.warc.WARCReader.dump(WARCReader.java:104) at org.archive.io.ArchiveReader.output(ArchiveReader.java:627) at org.archive.io.warc.WARCReader.output(WARCReader.java:156) at org.archive.io.warc.WARCReader.main(WARCReader.java:300) Caused by: java.io.IOException: Unexpected character a(Expecting d) at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:81) at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:71) at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:190) at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:460) ... 4 more Thanks in advance for any help/advice Olaf freyer

    JIRA | 10 years ago | Michael Stack
    java.lang.RuntimeException: java.io.IOException: Unexpected character a(Expecting d)
  2. 0

    Dear IA-Team, it seems like there exists yet another issue with WARC files in Heritrix-1.12.0. I'm unable to read non-compressed WARC files with the current release. (happens either when I directly write non-compressed WARC files or when I uncompress compressed WARC files (which I were able to read prior to uncompressing them)) sh warcreader -f dump /heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10} TODO: Unimplemented 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNUNG: Trying skip of failed record cleanup of {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10}: Unexpected character a(Expecting d) 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNUNG: Trying skip of failed record cleanup of {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10}: Unexpected character 41(Expecting d) 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator next WARNUNG: Bad Record. Trying skip (Current offset 218): Unexpected character 57(Expecting d) Exception processing /heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc: After retry (Offset 218) java.lang.RuntimeException: After retry (Offset 218) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:529) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:455) at org.archive.io.warc.v10.WARCReader.dump(WARCReader.java:106) at org.archive.io.ArchiveReader.output(ArchiveReader.java:649) at org.archive.io.warc.v10.WARCReader.output(WARCReader.java:157) at org.archive.io.warc.v10.WARCReader.main(WARCReader.java:301) Caused by: java.io.IOException: Unexpected character 52(Expecting d) at org.archive.io.warc.v10.WARCReader.readExpectedChar(WARCReader.java:82) at org.archive.io.warc.v10.WARCReader.gotoEOR(WARCReader.java:72) at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:192) at org.archive.io.ArchiveReader.get(ArchiveReader.java:142) at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:579) at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:554) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:522) ... 5 more Basically the same issue exists for the v12 WARCReader, too... Regards Olaf Freyer

    JIRA | 10 years ago | Olaf Freyer
    java.lang.RuntimeException: After retry (Offset 218)
  3. 0

    Dear IA-Team, it seems like there exists yet another issue with WARC files in Heritrix-1.12.0. I'm unable to read non-compressed WARC files with the current release. (happens either when I directly write non-compressed WARC files or when I uncompress compressed WARC files (which I were able to read prior to uncompressing them)) sh warcreader -f dump /heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10} TODO: Unimplemented 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNUNG: Trying skip of failed record cleanup of {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10}: Unexpected character a(Expecting d) 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNUNG: Trying skip of failed record cleanup of {content-type=text/plain, reader-identifier=/heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc, absolute-offset=0, subject-uri=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, record-identifier=urn:uuid:4806edc7-9244-4d70-af1d-1d6ff3ddca75, length=216, creation-date=20070319123730, type=warcinfo, Filename=IAH-20070319123730-00002-t5.warc, version=0.10}: Unexpected character 41(Expecting d) 19.03.2007 13:47:27 org.archive.io.ArchiveReader$ArchiveRecordIterator next WARNUNG: Bad Record. Trying skip (Current offset 218): Unexpected character 57(Expecting d) Exception processing /heritrix/jobs/working3-20070319123718924/warcs/IAH-20070319123730-00002-t5.warc: After retry (Offset 218) java.lang.RuntimeException: After retry (Offset 218) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:529) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:455) at org.archive.io.warc.v10.WARCReader.dump(WARCReader.java:106) at org.archive.io.ArchiveReader.output(ArchiveReader.java:649) at org.archive.io.warc.v10.WARCReader.output(WARCReader.java:157) at org.archive.io.warc.v10.WARCReader.main(WARCReader.java:301) Caused by: java.io.IOException: Unexpected character 52(Expecting d) at org.archive.io.warc.v10.WARCReader.readExpectedChar(WARCReader.java:82) at org.archive.io.warc.v10.WARCReader.gotoEOR(WARCReader.java:72) at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:192) at org.archive.io.ArchiveReader.get(ArchiveReader.java:142) at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:579) at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:554) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:522) ... 5 more Basically the same issue exists for the v12 WARCReader, too... Regards Olaf Freyer

    JIRA | 10 years ago | Olaf Freyer
    java.lang.RuntimeException: After retry (Offset 218)
  4. Speed up your debug routine!

    Automated exception search integrated into your IDE

  5. 0

    Here is my code: public class WarcTest { public static void main(String[] args) { String warcPath = "c:\\temp\\pipeData\\warc\\test\\11.warc"; WARCReader warcReader; try { warcReader = WARCReaderFactory.get(new File(warcPath)); Iterator<ArchiveRecord> it = warcReader.iterator(); while (it.hasNext()) { ArchiveRecord record = it.next(); record.dump(); } } catch (IOException e) { e.printStackTrace(); } } } And here is the warc file: WARC/0.18 WARC-Type: warcinfo WARC-Date: 2009-07-15T06:56:30Z WARC-Filename: 1.warc WARC-Record-ID: <urn:uuid:f496e1f2-b96c-45f0-9f43-a1385c7b0939> Content-Type: application/warc-fields Content-Length: 80 Content-Description: Made from C:\temp\1.arc by org.archive.io.Arc2Warc/5800 WARC/0.18 WARC-Type: resource WARC-Target-URI: http://vteoria.com/dif/warc_shaul.html WARC-Date: 20090715065512 IP-Address: 72.167.131.216 WARC-Record-ID: <urn:uuid:63d42fd5-0fae-4a15-81a2-1443e394b449> Content-Type: application/http; msgtype=response Content-Length: 673 HTTP/1.1 200 OK Date: Wed, 15 Jul 2009 06:55:12 GMT Server: Apache Connection: close Content-Type: text/html <HTML> <HEAD> <TITLE>Your Title Here</TITLE> </HEAD> <BODY BGCOLOR="FFFFFF"> <HR> <a href="http://somegreatsite.com">Link Name</a> is a link to another nifty site <H1>This is a Header</H1> <H2>This is a Medium Header</H2> Send me mail at <a href="mailto:support@yourcompany.com"> support@yourcompany.com</a>. <P> This is a new paragraph! <P> <B>This is a new paragraph!</B> <BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B> <HR> </BODY> </HTML> The errors that i get is: Content-Description: Made from C:\temp\1.arc by org.archive.io.Arc2Warc/5800 Jul 27, 2009 1:17:14 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNING: Trying skip of failed record cleanup of {WARC-Type=warcinfo, WARC-Filename=1.warc, reader-identifier=c:\temp\pipeData\warc\test\11.warc, WARC-Date=2009-07-15T06:56:30Z, absolute-offset=0, Content-Length=80, WARC-Record-ID=<urn:uuid:f496e1f2-b96c-45f0-9f43-a1385c7b0939>, Content-Type=application/warc-fields}: Unexpected character a(Expecting d) Jul 27, 2009 1:17:14 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext WARNING: Trying skip of failed record cleanup of {WARC-Type=warcinfo, WARC-Filename=1.warc, reader-identifier=c:\temp\pipeData\warc\test\11.warc, WARC-Date=2009-07-15T06:56:30Z, absolute-offset=0, Content-Length=80, WARC-Record-ID=<urn:uuid:f496e1f2-b96c-45f0-9f43-a1385c7b0939>, Content-Type=application/warc-fields}: Unexpected character 41(Expecting d) Jul 27, 2009 1:17:14 PM org.archive.io.ArchiveReader$ArchiveRecordIterator next WARNING: Bad Record. Trying skip (Current offset 296): Unexpected character 57(Expecting d) Exception in thread "main" java.lang.RuntimeException: After retry (Offset 296) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:535) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:461) at example.WarcTest.main(WarcTest.java:23) Caused by: java.io.IOException: Unexpected character 52(Expecting d) at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:82) at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:72) at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:192) at org.archive.io.ArchiveReader.get(ArchiveReader.java:142) at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:585) at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:560) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:528) ... 2 more I think I solved this by modifying the class ArchivedRecord and changing the "read" method to be public int read(byte[] b, int offset, int length) throws IOException { int read = Math.min(length, available()); if (read != -1 && read != 0) { read = this.in.read(b, offset, read); if (read == -1) { String msg = "Premature EOF before end-of-record: " + getHeader().getHeaderFields(); if (isStrict()) { throw new IOException(msg); } setEor(true); System.err.println(Level.WARNING.toString() + " " + msg); } if (this.digest != null && read >= 0) { this.digest.update(b, offset, read); } } /* * Shaul K. set the read to -1 only after the actual increment is done. */ incrementPosition(read); if (read == -1 || read == 0) { read = -1; } return read; } Can you verify? otherwise, what am i doing wrong?

    JIRA | 7 years ago | Shaul Kushelevsky
    java.lang.RuntimeException: After retry (Offset 296)
  6. 0

    GitHub comment 17#71220467

    GitHub | 2 years ago | machawk1
    java.lang.RuntimeException: After retry (Offset 94220)

    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. java.io.IOException

      Unexpected character a(Expecting d)

      at org.archive.io.warc.WARCReader.readExpectedChar()
    2. webarchive-commons
      WARCReader.main
      1. org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:81)
      2. org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:71)
      3. org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:190)
      4. org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:460)
      5. org.archive.io.warc.WARCReader.dump(WARCReader.java:104)
      6. org.archive.io.ArchiveReader.output(ArchiveReader.java:627)
      7. org.archive.io.warc.WARCReader.output(WARCReader.java:156)
      8. org.archive.io.warc.WARCReader.main(WARCReader.java:300)
      8 frames