java.io.IOException: /home/brad/projects/funky-arc/EG_11_0_crawl31_.20060109053503.arc.gz is not an Internet Archive ARC file.

JIRA | Brad Tofel | 8 years ago
  1. 0

    Karl was running jobs looking at issue where we OOME if multiple jobs are run in series. Concurrently he was reviewing arc validation tools. He found the following problem where an ARC had a record that had nothing but the ARC Record meta data line -- and it was not the last record in the ARC. He was crawling against infiniteurl so records are small and things are running faster than usual. Here are the records at the end of the ARC. A bunch got written in a flurry of activity -- all at the same moment: 20050513213319 127.0.0.1 http://14-31-50-738.0.localhost:8888/infiniteurl/infinity/0-14-31-56-117.ht ml text/html - LC2HUXXNPG5UB43RSDDB6WISSMHCFFW3 80593 1767 bad-arc WARNING Premature EOF before end-of-record: {archive-version=1.1, ip-address=127.0.0.1, archive-offset=81290, url=http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-89 3.html, archive-date=20050513213319, content-type=text/html, archive-length=1767} 20050513213319 127.0.0.1 http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht ml text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 81290 1767 bad-arc 20050513213319 127.0.0.1 http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html text/html - GI6CLSVDJEIANAEB7FS4J37OSXIOXPXY 81404 1753 bad-arc 20050513213319 127.0.0.1 http://14-32-11-415.0.localhost:8888/infiniteurl/infinity/1-14-32-15-690.ht ml text/html - S5TXPU5DLDNJ546WSFAKVRMGV23C6HBL 82095 1767 bad-arc 20050513213319 127.0.0.1 http://14-33-03-160.0.localhost:8888/infiniteurl/infinity/9.html text/html - ROFJQPYHC35DSBP5ULDOBHOE3SS7HY35 82793 1754 bad-arc 20050513213319 127.0.0.1 http://14-31-40-393.0.localhost:8888/infiniteurl/infinity/4-14-31-48-864.ht ml text/html - 3BBTYOAVFVHNJF4KGBNO7EUFSTMSOCLJ 83484 1767 bad-arc 20050513213319 127.0.0.1 http://14-31-30-333.0.localhost:8888/infiniteurl/infinity/0-14-31-35-563.ht ml text/html - ZA73PPXZT6IY5C27QSDOQ3RI5ZQHZW7N 84181 1767 bad-arc 20050513213319 127.0.0.1 http://14-32-01-60.1.localhost:8888/infiniteurl/infinity/4-14-32-06-221.htm l text/html - 2WMTZF7EKMM6CTTF33TIP4CC4QGSMV5G 84877 1766 bad-arc 20050513213319 127.0.0.1 http://14-32-52-924.1.localhost:8888/infiniteurl/infinity/2-14-32-54-731.ht ml text/html - 7IDQKMFTLYIOLVVGJ5Y2YDYNYCGGGGAM 85575 1767 bad-arc The second record in the above has a warning about its inconsistent length. Here is how things look when we zcat: .... <a href="/infiniteurl/infinity/7-14-33-19-222.html">7</a>, <a href="/infiniteurl/infinity/8-14-33-19-222.html">8</a>, <a href="/infiniteurl/infinity/9-14-33-19-222.html">9</a> </body> </html> http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht ml 127.0.0.1 20050513213319 text/html 1767 http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html 127.0.0.1 20050513213319 text/html 1753 HTTP/1.1 200 OK^M Date: Fri, 13 May 2005 21:33:19 GMT^M Server: Jetty/4.2.19 (Linux/2.4.27 i386 java/1.5.0_01)^M Connection: close^M Content-Type: text/html;charset=ISO-8859-1^M ^M ... Karl had the old logs around and we found following in Heritrix out: Followup: lines 533-542 of labcrawl04:/1/webcrawl-test/1/webcrawl-test/2-heritrix/heritrix-1.5.0-20050 5122111/heritrix_out.log: <<< java.lang.OutOfMemoryError: Java heap space #28 http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht ml (0 attempts) LLLLLL http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/1.html Current processor: Archiver ACTIVE for 2s441ms Where: ABOUT_TO_BEGIN_PROCESSOR for 2430ms java.lang.OutOfMemoryError: Java heap space >>>>>> Exception in thread "StartNextJob" java.lang.OutOfMemoryError: Java heap space From Karl, "So it looks like this record was aborted just after the header line was written." The writing of such incomplete records should never happen, not even in OOME situations. This issue is about adding to arc writer the handling of OOMEs (At least, the file should be marked invalid). Below are original notes from Karl Karl Thiessen <karl@archive.org> writes: >> In my testing, I found a bad ARC record created with a header line but no data at all. Shell excerpt fillows, with commentary after: >> >> === begin shell excerpt == >> [1] 10:39:24 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1026$ sh $HERITRIX_HOME/bin/arcreader -s bad-arc.gz > /dev/null >> Exception in thread "main" java.io.IOException: Premature EOF before end-of-record: {archive-version=1.1, ip-address=127.0.0.1, archive-offset=81290, url=http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-89 3.html, archive-date=20050513213319, content-type=text/html, archive-length=1767} >> at org.archive.io.arc.ARCRecord.read(ARCRecord.java:416) >> at org.archive.io.arc.ARCRecord.skip(ARCRecord.java:450) >> at org.archive.io.arc.ARCRecord.skip(ARCRecord.java:474) >> at org.archive.io.arc.ARCRecord.close(ARCRecord.java:352) >> at org.archive.io.arc.ARCReader.outputARCRecordCdx(ARCReader.java:989) >> at org.archive.io.arc.ARCReader.cdxOutput(ARCReader.java:918) >> at org.archive.io.arc.ARCReader.output(ARCReader.java:841) >> at org.archive.io.arc.ARCReader.main(ARCReader.java:1169) >> >> [1] 10:39:32 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1027$ /home/brad/projects/web/validate-arc-files bad-arc.gz >> EVENT 1116610793 labcrawl04.archive.org bad-arc.gz ShortRecord coff 81290 http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht ml (http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.h tml 127.0.0.1 20050513213319 text/html 1767) want(1768) got(0) >> >> [1] 10:39:53 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1028$ ls -l IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz >> lrwxrwxrwx 1 karl webcrawl 180 May 20 10:35 IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz -> /1/webcrawl-test/1-crawldata/test-results/heritrix-1.5.0-200505122111-20050 513T141347/t1123230/default-20050513212748787/arcs/IAH-20050513213215-00004 -labcrawl04.archive.org.arc.gz >> >> [0] 10:40:06 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1029$ md5sum IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz bad-arc.gz >> 571e41cc1c0f90e7a2f9203372ad662a IAH-20050513213215-00004-labcrawl04.archive.org.arc.gz >> 571e41cc1c0f90e7a2f9203372ad662a bad-arc.gz >> >> [0] 10:40:44 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1030$ zgrep -A 10 -E '^http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893. html' bad-arc.gz >> http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-893.ht ml 127.0.0.1 20050513213319 text/html 1767 >> http://1.localhost:8888/infiniteurl/infinity/7-14-31-35-13.html 127.0.0.1 20050513213319 text/html 1753 >> HTTP/1.1 200 OK >> Date: Fri, 13 May 2005 21:33:19 GMT >> Server: Jetty/4.2.19 (Linux/2.4.27 i386 java/1.5.0_01) >> Connection: close >> Content-Type: text/html;charset=ISO-8859-1 >> >> >> >> >> [0] 10:45:16 (pts/5) karl@labcrawl04:/home/webcrawl/tests/BAD-ARCS 1031$ >> >> === end shell excerpt === >> >> >> The thing to notice is the two lines beginning http:// right on top of >> one another in the zgrep output. The top one is a perfectly fine >> header line, with presumably 1767 bytes opf data to follow, but the >> data isn't there. It is emphatically *not* the last record in the arc >> file; the beginning of the next record is clearly visible. Doesn't >> look liek this was tthe result of a job termination or crawler kill. >> >> Any ideas what would cause this? Could it be part of the pattern that >> leads to our high-prority bug, [ 1116456 ] ARCWriter length is wrong >> (But coherent gzip record) ? >> >> Thanks for any input, >> --K.

    JIRA | 10 years ago | Michael Stack
    java.io.IOException: Premature EOF before end-of-record: {archive-version=1.1, ip-address=127.0.0.1, archive-offset=81290, url=http://14-32-32-405.0.localhost:8888/infiniteurl/infinity/7-14-32-34-89 3.html, archive-date=20050513213319, content-type=text/html, archive-length=1767}
  2. Speed up your debug routine!

    Automated exception search integrated into your IDE

  3. 0

    Plugin isn't compatible with Custom Tools

    GitHub | 2 years ago | joshuaspence
    java.io.IOException: Cannot run program "arc" (in directory "/mnt/jenkins-slave-workspace/workspace/redacted"): error=2, No such file or directory
  4. 0

    Jenkins is unable to run arc

    Stack Overflow | 11 months ago | PhD
    java.io.IOException: Cannot run program "arc" (in directory "C:\Users\Jenkins\workspace\windows_box_testing"): CreateProcess error=2, The system cannot find the file specified

    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. java.io.IOException

      /home/brad/projects/funky-arc/EG_11_0_crawl31_.20060109053503.arc.gz is not an Internet Archive ARC file.

      at org.archive.io.arc.ARCReaderFactory.getArchiveReader()
    2. webarchive-commons
      ARCReader.main
      1. org.archive.io.arc.ARCReaderFactory.getArchiveReader(ARCReaderFactory.java:117)
      2. org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:106)
      3. org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:133)
      4. org.archive.io.ArchiveReaderFactory.get(ArchiveReaderFactory.java:126)
      5. org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:99)
      6. org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:93)
      7. org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:69)
      8. org.archive.io.arc.ARCReader.main(ARCReader.java:534)
      8 frames