org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

JIRA | Russell Jurney | 8 months ago
  1. 0

    In [7]: on_time_rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') 16/03/28 18:04:06 INFO mapred.FileInputFormat: Total input paths to process : 1 16/03/28 18:04:06 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Got job 2 (runJob at PythonRDD.scala:393) with 1 output partitions 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (runJob at PythonRDD.scala:393) 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Missing parents: List() 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (PythonRDD[13] at RDD at PythonRDD.scala:43), which has no missing parents 16/03/28 18:04:06 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 19.3 KB, free 249.2 KB) 16/03/28 18:04:06 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 9.7 KB, free 258.9 KB) 16/03/28 18:04:06 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:59881 (size: 9.7 KB, free: 511.1 MB) 16/03/28 18:04:06 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (PythonRDD[13] at RDD at PythonRDD.scala:43) 16/03/28 18:04:06 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 16/03/28 18:04:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2666 bytes) 16/03/28 18:04:06 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2) 16/03/28 18:04:06 INFO rdd.HadoopRDD: Input split: file:/Users/rjurney/Software/Agile_Data_Code_2/data/On_Time_On_Time_Performance_2015.csv.gz:0+312456777 16/03/28 18:04:06 INFO compress.CodecPool: Got brand-new decompressor [.gz] 16/03/28 18:04:07 INFO python.PythonRunner: Times: total = 1310, boot = 1249, init = 58, finish = 3 16/03/28 18:04:07 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 4475 bytes result sent to driver 16/03/28 18:04:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 1345 ms on localhost (1/1) 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 16/03/28 18:04:07 INFO scheduler.DAGScheduler: ResultStage 2 (runJob at PythonRDD.scala:393) finished in 1.346 s 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:393, took 1.361003 s 16/03/28 18:04:07 INFO spark.SparkContext: Starting job: take at SerDeUtil.scala:231 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Got job 3 (take at SerDeUtil.scala:231) with 1 output partitions 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (take at SerDeUtil.scala:231) 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Missing parents: List() 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[15] at mapPartitions at SerDeUtil.scala:146), which has no missing parents 16/03/28 18:04:07 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 19.6 KB, free 278.4 KB) 16/03/28 18:04:07 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 9.8 KB, free 288.2 KB) 16/03/28 18:04:07 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:59881 (size: 9.8 KB, free: 511.1 MB) 16/03/28 18:04:07 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[15] at mapPartitions at SerDeUtil.scala:146) 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 16/03/28 18:04:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,PROCESS_LOCAL, 2666 bytes) 16/03/28 18:04:07 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3) 16/03/28 18:04:07 INFO rdd.HadoopRDD: Input split: file:/Users/rjurney/Software/Agile_Data_Code_2/data/On_Time_On_Time_Performance_2015.csv.gz:0+312456777 16/03/28 18:04:07 INFO compress.CodecPool: Got brand-new decompressor [.gz] 16/03/28 18:04:07 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/28 18:04:07 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/28 18:04:07 ERROR scheduler.TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Cancelling stage 3 16/03/28 18:04:07 INFO scheduler.DAGScheduler: ResultStage 3 (take at SerDeUtil.scala:231) failed in 0.117 s 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Job 3 failed: take at SerDeUtil.scala:231, took 0.134593 s --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-7-d1f984f17e27> in <module>() ----> 1 on_time_rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') /Users/rjurney/Software/Agile_Data_Code_2/lib/pymongo_spark.pyc in saveToMongoDB(self, connection_string, config) 104 keyConverter='com.mongodb.spark.pickle.NoopConverter', 105 valueConverter='com.mongodb.spark.pickle.NoopConverter', --> 106 conf=conf) 107 108 /Users/rjurney/Software/Agile_Data_Code_2/spark/python/pyspark/rdd.pyc in saveAsNewAPIHadoopFile(self, path, outputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf) 1372 outputFormatClass, 1373 keyClass, valueClass, -> 1374 keyConverter, valueConverter, jconf) 1375 1376 def saveAsHadoopDataset(self, conf, keyConverter=None, valueConverter=None): /Users/rjurney/Software/Agile_Data_Code_2/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /Users/rjurney/Software/Agile_Data_Code_2/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /Users/rjurney/Software/Agile_Data_Code_2/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1328) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.take(RDD.scala:1302) at org.apache.spark.api.python.SerDeUtil$.pythonToPairRDD(SerDeUtil.scala:231) at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:775) at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

    JIRA | 8 months ago | Russell Jurney
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
  2. 0

    In [7]: on_time_rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') 16/03/28 18:04:06 INFO mapred.FileInputFormat: Total input paths to process : 1 16/03/28 18:04:06 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Got job 2 (runJob at PythonRDD.scala:393) with 1 output partitions 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (runJob at PythonRDD.scala:393) 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Missing parents: List() 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (PythonRDD[13] at RDD at PythonRDD.scala:43), which has no missing parents 16/03/28 18:04:06 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 19.3 KB, free 249.2 KB) 16/03/28 18:04:06 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 9.7 KB, free 258.9 KB) 16/03/28 18:04:06 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:59881 (size: 9.7 KB, free: 511.1 MB) 16/03/28 18:04:06 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006 16/03/28 18:04:06 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (PythonRDD[13] at RDD at PythonRDD.scala:43) 16/03/28 18:04:06 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks 16/03/28 18:04:06 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2666 bytes) 16/03/28 18:04:06 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2) 16/03/28 18:04:06 INFO rdd.HadoopRDD: Input split: file:/Users/rjurney/Software/Agile_Data_Code_2/data/On_Time_On_Time_Performance_2015.csv.gz:0+312456777 16/03/28 18:04:06 INFO compress.CodecPool: Got brand-new decompressor [.gz] 16/03/28 18:04:07 INFO python.PythonRunner: Times: total = 1310, boot = 1249, init = 58, finish = 3 16/03/28 18:04:07 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 4475 bytes result sent to driver 16/03/28 18:04:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 1345 ms on localhost (1/1) 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 16/03/28 18:04:07 INFO scheduler.DAGScheduler: ResultStage 2 (runJob at PythonRDD.scala:393) finished in 1.346 s 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:393, took 1.361003 s 16/03/28 18:04:07 INFO spark.SparkContext: Starting job: take at SerDeUtil.scala:231 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Got job 3 (take at SerDeUtil.scala:231) with 1 output partitions 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (take at SerDeUtil.scala:231) 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Missing parents: List() 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[15] at mapPartitions at SerDeUtil.scala:146), which has no missing parents 16/03/28 18:04:07 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 19.6 KB, free 278.4 KB) 16/03/28 18:04:07 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 9.8 KB, free 288.2 KB) 16/03/28 18:04:07 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:59881 (size: 9.8 KB, free: 511.1 MB) 16/03/28 18:04:07 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[15] at mapPartitions at SerDeUtil.scala:146) 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 16/03/28 18:04:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,PROCESS_LOCAL, 2666 bytes) 16/03/28 18:04:07 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3) 16/03/28 18:04:07 INFO rdd.HadoopRDD: Input split: file:/Users/rjurney/Software/Agile_Data_Code_2/data/On_Time_On_Time_Performance_2015.csv.gz:0+312456777 16/03/28 18:04:07 INFO compress.CodecPool: Got brand-new decompressor [.gz] 16/03/28 18:04:07 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/28 18:04:07 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/28 18:04:07 ERROR scheduler.TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 16/03/28 18:04:07 INFO scheduler.TaskSchedulerImpl: Cancelling stage 3 16/03/28 18:04:07 INFO scheduler.DAGScheduler: ResultStage 3 (take at SerDeUtil.scala:231) failed in 0.117 s 16/03/28 18:04:07 INFO scheduler.DAGScheduler: Job 3 failed: take at SerDeUtil.scala:231, took 0.134593 s --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-7-d1f984f17e27> in <module>() ----> 1 on_time_rdd.saveToMongoDB('mongodb://localhost:27017/agile_data_science.on_time_performance') /Users/rjurney/Software/Agile_Data_Code_2/lib/pymongo_spark.pyc in saveToMongoDB(self, connection_string, config) 104 keyConverter='com.mongodb.spark.pickle.NoopConverter', 105 valueConverter='com.mongodb.spark.pickle.NoopConverter', --> 106 conf=conf) 107 108 /Users/rjurney/Software/Agile_Data_Code_2/spark/python/pyspark/rdd.pyc in saveAsNewAPIHadoopFile(self, path, outputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf) 1372 outputFormatClass, 1373 keyClass, valueClass, -> 1374 keyConverter, valueConverter, jconf) 1375 1376 def saveAsHadoopDataset(self, conf, keyConverter=None, valueConverter=None): /Users/rjurney/Software/Agile_Data_Code_2/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /Users/rjurney/Software/Agile_Data_Code_2/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /Users/rjurney/Software/Agile_Data_Code_2/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1328) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.take(RDD.scala:1302) at org.apache.spark.api.python.SerDeUtil$.pythonToPairRDD(SerDeUtil.scala:231) at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:775) at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175) at net.razorvine.pickle.Unpickler.load(Unpickler.java:99) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more

    JIRA | 8 months ago | Russell Jurney
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
  3. 0

    pyspark saveAsSequenceFile with pyspark.ml.linalg.Vectors

    Stack Overflow | 4 months ago | Θεόφιλος Μουρατίδης
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 5, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.ml.linalg.SparseVector)
  4. Speed up your debug routine!

    Automated exception search integrated into your IDE

  5. 0

    "expected zero arguments for construction of ClassDict" error when training Naive Bayes in Pyspark

    Stack Overflow | 4 weeks ago | jartymcfly
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1683.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1683.0 (TID 30310, server123.es): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
  6. 0

    I am unable to create a DataFrame with PySpark if any of the {{datetime}} objects that pass through the conversion process have a {{tzinfo}} property set. This works fine: {code} In [9]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10),)]).toDF().collect() Out[9]: [Row(_1=datetime.datetime(2014, 7, 8, 11, 10))] {code} as expected, the tuple's schema is inferred as having one anonymous column with a datetime field, and the datetime roundtrips through to the Java side python deserialization and then back into python land upon {{collect}}. This however: {code} In [5]: from dateutil.tz import tzutc In [10]: sc.parallelize([(datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()),)]).toDF().collect() {code} explodes with {code} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2 at net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) at net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:154) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:119) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:114) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:114) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1520) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1211) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1199) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1401) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} By the looks of the error, it would appear as though the java depickler isn't expecting the pickle stream to provide that extra timezone constructor argument. Here's the disassembled pickle stream for a timezone-less datetime: {code} >>> object = datetime.datetime(2014, 7, 8, 11, 10) >>> stream = pickle.dumps(object) >>> pickletools.dis(stream) 0: c GLOBAL 'datetime datetime' 19: p PUT 0 22: ( MARK 23: S STRING '\x07\xde\x07\x08\x0b\n\x00\x00\x00\x00' 65: p PUT 1 68: t TUPLE (MARK at 22) 69: p PUT 2 72: R REDUCE 73: p PUT 3 76: . STOP highest protocol among opcodes = 0 {code} and then for one with a timezone: {code} >>> object = datetime.datetime(2014, 7, 8, 11, 10, tzinfo=tzutc()) >>> stream = pickle.dumps(object) >>> pickletools.dis(stream) 0: c GLOBAL 'datetime datetime' 19: p PUT 0 22: ( MARK 23: S STRING '\x07\xde\x07\x08\x0b\n\x00\x00\x00\x00' 65: p PUT 1 68: c GLOBAL 'copy_reg _reconstructor' 93: p PUT 2 96: ( MARK 97: c GLOBAL 'dateutil.tz tzutc' 116: p PUT 3 119: c GLOBAL 'datetime tzinfo' 136: p PUT 4 139: g GET 4 142: ( MARK 143: t TUPLE (MARK at 142) 144: R REDUCE 145: p PUT 5 148: t TUPLE (MARK at 96) 149: p PUT 6 152: R REDUCE 153: p PUT 7 156: t TUPLE (MARK at 22) 157: p PUT 8 160: R REDUCE 161: p PUT 9 164: . STOP highest protocol among opcodes = 0 {code} I would bet that the Pyrolite library is missing support for that nested object as a second tuple member in the reconstruction of the datetime object. Has anyone hit this before? Any more information I can provide?

    Apache's JIRA Issue Tracker | 2 years ago | Harry Brundage
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 12, localhost): net.razorvine.pickle.PickleException: invalid pickle data for datetime; expected 1 or 7 args, got 2

    1 unregistered visitors
    Not finding the right solution?
    Take a tour to get the most out of Samebug.

    Tired of useless tips?

    Automated exception search integrated into your IDE

    Root Cause Analysis

    1. org.apache.spark.SparkException

      Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

      at net.razorvine.pickle.objects.ClassDictConstructor.construct()
    2. pyrolite
      Unpickler.loads
      1. net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
      2. net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
      3. net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
      4. net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
      5. net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
      5 frames
    3. Spark
      SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply
      1. org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150)
      2. org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:149)
      2 frames
    4. Scala
      AbstractIterator.toArray
      1. scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      2. scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
      3. scala.collection.Iterator$class.foreach(Iterator.scala:727)
      4. scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      5. scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      6. scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      7. scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      8. scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      9. scala.collection.AbstractIterator.to(Iterator.scala:1157)
      10. scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      11. scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      12. scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      13. scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      13 frames
    5. Spark
      Executor$TaskRunner.run
      1. org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328)
      2. org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1328)
      3. org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
      4. org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
      5. org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      6. org.apache.spark.scheduler.Task.run(Task.scala:89)
      7. org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      7 frames
    6. Java RT
      Thread.run
      1. java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      2. java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      3. java.lang.Thread.run(Thread.java:745)
      3 frames