mapred.InputPathProcessor: Executor shut down On Friday, May 13, 2016 at 10:07:38 PM UTC-7, Xiao Zhou wrote: I am using hdfs as output. Thanks On Friday, May 13, 2016 at 5:48:51 PM UTC-7, Félix GV wrote: Ok cool, Did you end up needing the other change I made for the mv operation? Or are you using just HDFS for the output of the build now? -F On Fri, May 13, 2016 at 16:23 Xiao Zhou <xiao...@gmail.com> wrote: yes, I have opened a pull request https://github.com/voldemort/voldemort/pull/408 There are no more issues for s3 migration right now, Thanks for helping! On Friday, May 13, 2016 at 3:44:33 PM UTC-7, Félix GV wrote: All right, great! Do you think you could squash those three commits and create a formal Pull Request for your change? Also, are you in a state where things work end-to-end for you with S3? Or are there any other issues? -F On Fri, May 13, 2016 at 3:09 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Thanks here is the pull request for my change for the input: https://github.com/voldemort/voldemort/compare/master...xiaozh:088c47d84d7f11ea26ba9f223ca2cd22233c82cb?expand=1 On Friday, May 13, 2016 at 1:20:44 PM UTC-7, Félix GV wrote: Hi Xiao, These files are only kept on HDFS for the duration of the fetch. After the Voldemort servers are finished fetching the data, the files are deleted from HDFS. I believe it is probably transient enough for you to leave them on HDFS. Once you're done, you should definitely write a blog post about any and everything you needed to do to get Voldemort / BnP up and running on Amazon. Very interesting stuff! -F On Fri, May 13, 2016 at 1:06 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Amazon recommended to use hdfs as temp file storage and s3 as final file storage. The reason is hdfs can go away if the cluster is power down. We could write the output the hdfs and then distcp the files to s3 , which is not optimal but acceptable. I was trying to find out if there is an easy way to write to s3 directly so we can remove an extra step. Seems it will be too much trouble so we will just take the extra copy step. It can be done out of the normal workflow so should not have too much impact. Thanks for helping out. On Friday, May 13, 2016 at 12:53:56 PM UTC-7, Félix GV wrote: Hi Xiao, That sounds like a correct assessment. Although I'm not sure if the right solution is to build the files in /tmp/ on HDFS, and then to copy those files over to S3 afterwards. Why not build the files on S3 and copy them to some other place on S3, or build them on HDFS and move them on HDFS? The API I used does allow two filesystems to be specified, but I think it may be complicated to get a hold of both of these FS from within the HadoopStoreWriter class (not saying it's impossible, just that it may be troublesome). Since you are apparently using HDFS anyway, would there be any downside to setting your output path to be on HDFS as well? In which case, you could use the regular mv operation which is already well-supported? Thanks for your debugging effort. Let me know what you think. -F On Fri, May 13, 2016 at 12:13 PM, Xiao Zhou <xiaoz...@gmail.com > wrote: Got this error on the else branch: else { logger.info("Moving " + src + " to " + dest); fs.rename(src, dest); I think the issue is fs is created from the temp directory which is on hdfs? -tmp /tmp/rosb \ we probably don't want the temp to be on s3. The condition if (fs.getScheme().toLowerCase().contains("s3")) { did not get invoked. and we need to do a cross file system copy from hdfs to s3 in that code , and the 2 fs in the FileUitl.copy probably need to be different?

Google Groups | Xiao Zhou | 7 months ago
  1. 0

    does voldemort RO store builder support get input from s3n?

    Google Groups | 7 months ago | Xiao Zhou
    mapred.InputPathProcessor: Executor shut down On Friday, May 13, 2016 at 10:07:38 PM UTC-7, Xiao Zhou wrote: I am using hdfs as output. Thanks On Friday, May 13, 2016 at 5:48:51 PM UTC-7, Félix GV wrote: Ok cool, Did you end up needing the other change I made for the mv operation? Or are you using just HDFS for the output of the build now? -F On Fri, May 13, 2016 at 16:23 Xiao Zhou <xiao...@gmail.com> wrote: yes, I have opened a pull request https://github.com/voldemort/voldemort/pull/408 There are no more issues for s3 migration right now, Thanks for helping! On Friday, May 13, 2016 at 3:44:33 PM UTC-7, Félix GV wrote: All right, great! Do you think you could squash those three commits and create a formal Pull Request for your change? Also, are you in a state where things work end-to-end for you with S3? Or are there any other issues? -F On Fri, May 13, 2016 at 3:09 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Thanks here is the pull request for my change for the input: https://github.com/voldemort/voldemort/compare/master...xiaozh:088c47d84d7f11ea26ba9f223ca2cd22233c82cb?expand=1 On Friday, May 13, 2016 at 1:20:44 PM UTC-7, Félix GV wrote: Hi Xiao, These files are only kept on HDFS for the duration of the fetch. After the Voldemort servers are finished fetching the data, the files are deleted from HDFS. I believe it is probably transient enough for you to leave them on HDFS. Once you're done, you should definitely write a blog post about any and everything you needed to do to get Voldemort / BnP up and running on Amazon. Very interesting stuff! -F On Fri, May 13, 2016 at 1:06 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Amazon recommended to use hdfs as temp file storage and s3 as final file storage. The reason is hdfs can go away if the cluster is power down. We could write the output the hdfs and then distcp the files to s3 , which is not optimal but acceptable. I was trying to find out if there is an easy way to write to s3 directly so we can remove an extra step. Seems it will be too much trouble so we will just take the extra copy step. It can be done out of the normal workflow so should not have too much impact. Thanks for helping out. On Friday, May 13, 2016 at 12:53:56 PM UTC-7, Félix GV wrote: Hi Xiao, That sounds like a correct assessment. Although I'm not sure if the right solution is to build the files in /tmp/ on HDFS, and then to copy those files over to S3 afterwards. Why not build the files on S3 and copy them to some other place on S3, or build them on HDFS and move them on HDFS? The API I used does allow two filesystems to be specified, but I think it may be complicated to get a hold of both of these FS from within the HadoopStoreWriter class (not saying it's impossible, just that it may be troublesome). Since you are apparently using HDFS anyway, would there be any downside to setting your output path to be on HDFS as well? In which case, you could use the regular mv operation which is already well-supported? Thanks for your debugging effort. Let me know what you think. -F On Fri, May 13, 2016 at 12:13 PM, Xiao Zhou <xiaoz...@gmail.com > wrote: Got this error on the else branch: else { logger.info("Moving " + src + " to " + dest); fs.rename(src, dest); I think the issue is fs is created from the temp directory which is on hdfs? -tmp /tmp/rosb \ we probably don't want the temp to be on s3. The condition if (fs.getScheme().toLowerCase().contains("s3")) { did not get invoked. and we need to do a cross file system copy from hdfs to s3 in that code , and the 2 fs in the FileUitl.copy probably need to be different?

    Root Cause Analysis

    1. mapred.InputPathProcessor

      Executor shut down On Friday, May 13, 2016 at 10:07:38 PM UTC-7, Xiao Zhou wrote: I am using hdfs as output. Thanks On Friday, May 13, 2016 at 5:48:51 PM UTC-7, Félix GV wrote: Ok cool, Did you end up needing the other change I made for the mv operation? Or are you using just HDFS for the output of the build now? -F On Fri, May 13, 2016 at 16:23 Xiao Zhou <xiao...@gmail.com> wrote: yes, I have opened a pull request https://github.com/voldemort/voldemort/pull/408 There are no more issues for s3 migration right now, Thanks for helping! On Friday, May 13, 2016 at 3:44:33 PM UTC-7, Félix GV wrote: All right, great! Do you think you could squash those three commits and create a formal Pull Request for your change? Also, are you in a state where things work end-to-end for you with S3? Or are there any other issues? -F On Fri, May 13, 2016 at 3:09 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Thanks here is the pull request for my change for the input: https://github.com/voldemort/voldemort/compare/master...xiaozh:088c47d84d7f11ea26ba9f223ca2cd22233c82cb?expand=1 On Friday, May 13, 2016 at 1:20:44 PM UTC-7, Félix GV wrote: Hi Xiao, These files are only kept on HDFS for the duration of the fetch. After the Voldemort servers are finished fetching the data, the files are deleted from HDFS. I believe it is probably transient enough for you to leave them on HDFS. Once you're done, you should definitely write a blog post about any and everything you needed to do to get Voldemort / BnP up and running on Amazon. Very interesting stuff! -F On Fri, May 13, 2016 at 1:06 PM, Xiao Zhou <xiaoz...@gmail.com> wrote: Amazon recommended to use hdfs as temp file storage and s3 as final file storage. The reason is hdfs can go away if the cluster is power down. We could write the output the hdfs and then distcp the files to s3 , which is not optimal but acceptable. I was trying to find out if there is an easy way to write to s3 directly so we can remove an extra step. Seems it will be too much trouble so we will just take the extra copy step. It can be done out of the normal workflow so should not have too much impact. Thanks for helping out. On Friday, May 13, 2016 at 12:53:56 PM UTC-7, Félix GV wrote: Hi Xiao, That sounds like a correct assessment. Although I'm not sure if the right solution is to build the files in /tmp/ on HDFS, and then to copy those files over to S3 afterwards. Why not build the files on S3 and copy them to some other place on S3, or build them on HDFS and move them on HDFS? The API I used does allow two filesystems to be specified, but I think it may be complicated to get a hold of both of these FS from within the HadoopStoreWriter class (not saying it's impossible, just that it may be troublesome). Since you are apparently using HDFS anyway, would there be any downside to setting your output path to be on HDFS as well? In which case, you could use the regular mv operation which is already well-supported? Thanks for your debugging effort. Let me know what you think. -F On Fri, May 13, 2016 at 12:13 PM, Xiao Zhou <xiaoz...@gmail.com > wrote: Got this error on the else branch: else { logger.info("Moving " + src + " to " + dest); fs.rename(src, dest); I think the issue is fs is created from the temp directory which is on hdfs? -tmp /tmp/rosb \ we probably don't want the temp to be on s3. The condition if (fs.getScheme().toLowerCase().contains("s3")) { did not get invoked. and we need to do a cross file system copy from hdfs to s3 in that code , and the 2 fs in the FileUitl.copy probably need to be different?

      at org.apache.hadoop.hdfs.DistributedFileSystem.rename()
    2. Apache Hadoop HDFS
      DistributedFileSystem.rename
      1. org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:575)
      1 frame