org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP"

JIRA | Michael Stack | 1 decade ago
tip
Your exception is missing from the Samebug knowledge base.
Here are the best solutions we found on the Internet.
Click on the to mark the helpful solution and get rewards for you help.
  1. 0

    From Tom Emerson: I started a crawl over 4 (Simplified) Chinese sites. The configuration was set to use 15 toe threads, domain scope, with Accept-Language: zh-CN, zh-SG, zh, en and a default-encoding of CP936. The frontier configuration is: <newObject name="frontier" class="org.archive.crawler.frontier.Frontier"> <float name="delay-factor">2.5</float> <integer name="max-delay-ms">2500</integer> <integer name="min-delay-ms">250</integer> <integer name="max-retries">10</integer> <long name="retry-delay-seconds">360</long> <boolean name="hold-queues">false</boolean> <integer name="host-valence">2</integer> <integer name="total-bandwidth-usage-KB-sec">0</integer> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer> <integer name="host-queues-memory-capacity">200</integer> </newObject> All four seeds were crawled successfully. However, the console showed that things appeared to stop after 123 documents were retrieved. Attempts to view the thread log failed: the server would just hang, though I was able to get back to the console. Then I noticed that the heritrix_out.log file was growing very fast: when it hit 4.4 gigabytes I shutdown Heritrix from the WUI. Unfortunately it kept growing: a zombie Heritrix process was still running and had to be killed from the command-line. The last several hundred thousand lines of the log file look like the following excerpt: KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf And then switches to Snooze queues size: 25 KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue modern.dayoo.com Length: 1 Status: SNOOZED Wakes in: -1172829ms Last enqueued: http://modern.dayoo.com/ Last dequeued: dns:modern.dayoo.com KeyedQueue partner.cpc.sohu.com Length: 1 Status: SNOOZED Wakes in: -1172817ms Last enqueued: http://partner.cpc.sohu.com/cpc/partner_iframe.php?sid=241&pid=wwwbjd&type= 24 Last dequeued: dns:partner.cpc.sohu.com KeyedQueue informationtimes.dayoo.com Length: 9 Status: SNOOZED Wakes in: -1172749ms Last enqueued: http://informationtimes.dayoo.com/gb/content/2004-07/20/content_1639599.htm Last dequeued: dns:informationtimes.dayoo.com And then eventually ends with Jul 20, 2004 5:50:13 PM org.archive.crawler.frontier.Frontier wakeReadyQueues SEVERE: first() item couldn't be remove()d! - org.archive.crawler.frontier.KeyedQueue@501268 - false which may or may not be related to hitting this with a kill -9. My local errors log shows: 20040720213038762 -2 . #2 http://c3.thecounter.com/robots.txt . . . EP http://c3.thecounter.com/id=2424977 org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP" at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.ja va:1965) at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase. java:2659) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:10 93) at org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod. java:157) at org.archive.httpclient.PatchedHttpClient.executeMethod(PatchedHttpClient.ja va:294) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:231) at org.archive.crawler.framework.Processor.process(Processor.java:106) at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:138) uri-errors.log and runtime-errors.log are empty. This is running on a sync against HEAD yesterday (Tuesday) afternoon. I'm willing to be a guinea pig here if people have suggestions. Thanks. -tree P.S. I'm currently bzip2ing the heritrix_out.log file --- I'll keep it around this time. I'm going to try some other Chinese sites and see if the same thing happens. From Gordon: Thanks for the report; I presume you're using very recent CVS HEAD code? Looks like some debugging output put in to illuminate a believed-fixed bug -- the logger code in Frontier.wakeReadyQueues() -- is again being triggered, endlessly. I suspect an item inside the snoozeQueues is mutating when it shouldn't, leading to an inconsistent TreeSet where snoozeQueues.contains(snoozeQueues.first()) is false, and a spin over that debugging output. I see you're running with a non-default 'valence'; that's probably the trigger. Valence values higher than 1 have only been lightly tested and aren't regularly used at IA. Previous bugs in the code which allows multiple in-process URIs from the same host-queue have caused queues to live in two places where they should only be in one, which could cause this kind of TreeSet sort-invariant corruption. Please file a bug, especially if your initial settings and seeds trigger the problem every time. It would also be interesting to try triggering the bug with with assertions enabled ('-ea' VM arg), because that may catch the root cause earlier. I suspect it can't be reproduced at all with valence = 1. - Gordon @ IA I just tried it with order.xml and seeds supplied by Tom and reproduced the problem. See attached.

    JIRA | 1 decade ago | Michael Stack
    org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP"

    Root Cause Analysis

    1. org.apache.commons.httpclient.HttpRecoverableException

      Error in parsing the status line from the response: unable to find line starting with "HTTP"

      at org.apache.commons.httpclient.HttpMethodBase.readResponse()
    2. HttpClient
      HttpMethodBase.execute
      1. org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
      2. org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
      3. org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
      3 frames
    3. webarchive-commons
      PatchedHttpClient.executeMethod
      1. org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.java:157)
      2. org.archive.httpclient.PatchedHttpClient.executeMethod(PatchedHttpClient.java:294)
      2 frames
    4. HttpClient
      HttpClient.executeMethod
      1. org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
      1 frame
    5. org.archive.crawler
      ToeThread.run
      1. org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:231)
      2. org.archive.crawler.framework.Processor.process(Processor.java:106)
      3. org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255)
      4. org.archive.crawler.framework.ToeThread.run(ToeThread.java:138)
      4 frames