org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP"

JIRA | Michael Stack | 10 years ago
  1. 0

    From Tom Emerson: I started a crawl over 4 (Simplified) Chinese sites. The configuration was set to use 15 toe threads, domain scope, with Accept-Language: zh-CN, zh-SG, zh, en and a default-encoding of CP936. The frontier configuration is: <newObject name="frontier" class="org.archive.crawler.frontier.Frontier"> <float name="delay-factor">2.5</float> <integer name="max-delay-ms">2500</integer> <integer name="min-delay-ms">250</integer> <integer name="max-retries">10</integer> <long name="retry-delay-seconds">360</long> <boolean name="hold-queues">false</boolean> <integer name="host-valence">2</integer> <integer name="total-bandwidth-usage-KB-sec">0</integer> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer> <integer name="host-queues-memory-capacity">200</integer> </newObject> All four seeds were crawled successfully. However, the console showed that things appeared to stop after 123 documents were retrieved. Attempts to view the thread log failed: the server would just hang, though I was able to get back to the console. Then I noticed that the heritrix_out.log file was growing very fast: when it hit 4.4 gigabytes I shutdown Heritrix from the WUI. Unfortunately it kept growing: a zombie Heritrix process was still running and had to be killed from the command-line. The last several hundred thousand lines of the log file look like the following excerpt: KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf And then switches to Snooze queues size: 25 KeyedQueue www.dayoo.com Length: 43 Status: READY Last enqueued: http://www.dayoo.com/gb/content/2004-07/21/content_1640649.htm Last dequeued: http://www.dayoo.com/img/gzzx.swf KeyedQueue modern.dayoo.com Length: 1 Status: SNOOZED Wakes in: -1172829ms Last enqueued: http://modern.dayoo.com/ Last dequeued: dns:modern.dayoo.com KeyedQueue partner.cpc.sohu.com Length: 1 Status: SNOOZED Wakes in: -1172817ms Last enqueued: http://partner.cpc.sohu.com/cpc/partner_iframe.php?sid=241&pid=wwwbjd&type= 24 Last dequeued: dns:partner.cpc.sohu.com KeyedQueue informationtimes.dayoo.com Length: 9 Status: SNOOZED Wakes in: -1172749ms Last enqueued: http://informationtimes.dayoo.com/gb/content/2004-07/20/content_1639599.htm Last dequeued: dns:informationtimes.dayoo.com And then eventually ends with Jul 20, 2004 5:50:13 PM org.archive.crawler.frontier.Frontier wakeReadyQueues SEVERE: first() item couldn't be remove()d! - org.archive.crawler.frontier.KeyedQueue@501268 - false which may or may not be related to hitting this with a kill -9. My local errors log shows: 20040720213038762 -2 . #2 http://c3.thecounter.com/robots.txt . . . EP http://c3.thecounter.com/id=2424977 org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP" at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.ja va:1965) at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase. java:2659) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:10 93) at org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod. java:157) at org.archive.httpclient.PatchedHttpClient.executeMethod(PatchedHttpClient.ja va:294) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529) at org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:231) at org.archive.crawler.framework.Processor.process(Processor.java:106) at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:138) uri-errors.log and runtime-errors.log are empty. This is running on a sync against HEAD yesterday (Tuesday) afternoon. I'm willing to be a guinea pig here if people have suggestions. Thanks. -tree P.S. I'm currently bzip2ing the heritrix_out.log file --- I'll keep it around this time. I'm going to try some other Chinese sites and see if the same thing happens. From Gordon: Thanks for the report; I presume you're using very recent CVS HEAD code? Looks like some debugging output put in to illuminate a believed-fixed bug -- the logger code in Frontier.wakeReadyQueues() -- is again being triggered, endlessly. I suspect an item inside the snoozeQueues is mutating when it shouldn't, leading to an inconsistent TreeSet where snoozeQueues.contains(snoozeQueues.first()) is false, and a spin over that debugging output. I see you're running with a non-default 'valence'; that's probably the trigger. Valence values higher than 1 have only been lightly tested and aren't regularly used at IA. Previous bugs in the code which allows multiple in-process URIs from the same host-queue have caused queues to live in two places where they should only be in one, which could cause this kind of TreeSet sort-invariant corruption. Please file a bug, especially if your initial settings and seeds trigger the problem every time. It would also be interesting to try triggering the bug with with assertions enabled ('-ea' VM arg), because that may catch the root cause earlier. I suspect it can't be reproduced at all with valence = 1. - Gordon @ IA I just tried it with order.xml and seeds supplied by Tom and reproduced the problem. See attached.

    JIRA | 10 years ago | Michael Stack
    org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP"
  2. 0

    Bug in Head Method (with authenticated server)

    hc-dev | 1 decade ago | Pill, Juergen
    org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP/"
  3. 0

    No Response from SunONE web server 6.1

    Oracle Community | 5 years ago | 709949
    org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP"
  4. Speed up your debug routine!

    Automated exception search integrated into your IDE

  5. 0

    Re: Problem with proxy server and PUT method.

    hc-dev | 1 decade ago | Chris Smith
    org.apache.commons.httpclient.HttpRecoverableException: Error in parsing the status line from the response: unable to find line starting with "HTTP/"
  6. 0

    [RTFACT-1593] Deployment fails with hidden.org.apache.commons.httpclient.HttpRecoverableException - JFrog JIRA

    jfrog.com | 12 months ago
    org.apache.maven.lifecycle.LifecycleExecutionException: Error deploying artifact: PUT request for: gov/usda/plants-installer/1.0-SNAPSHOT/plants-installer-1.0-20091023.151044-8.tar.gz to plants-installer-1.0-SNAPSHOT.tar.gz failed

    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. org.apache.commons.httpclient.HttpRecoverableException

      Error in parsing the status line from the response: unable to find line starting with "HTTP"

      at org.apache.commons.httpclient.HttpMethodBase.readResponse()
    2. HttpClient
      HttpMethodBase.execute
      1. org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
      2. org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
      3. org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
      3 frames
    3. webarchive-commons
      PatchedHttpClient.executeMethod
      1. org.archive.httpclient.HttpRecorderGetMethod.execute(HttpRecorderGetMethod.java:157)
      2. org.archive.httpclient.PatchedHttpClient.executeMethod(PatchedHttpClient.java:294)
      2 frames
    4. HttpClient
      HttpClient.executeMethod
      1. org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
      1 frame
    5. org.archive.crawler
      ToeThread.run
      1. org.archive.crawler.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:231)
      2. org.archive.crawler.framework.Processor.process(Processor.java:106)
      3. org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:255)
      4. org.archive.crawler.framework.ToeThread.run(ToeThread.java:138)
      4 frames