SolarCoin joins Climate Chain Coalition

During the One Planet Summit on December 12, 2017 in Paris, France (on the 2nd anniversary of the Paris Agreement), a multi-stakeholder group of 25 organizations working on distributed ledger…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




A Python solution to run Query in Google Cloud Dataproc using API and return back query results

In some situations that we need to run query in Google Cloud Dataproc from local and get back query results, for example, we need to send users some messages based on query results from dataproc table, or we need to update mysql table based on query results from dataproc table.

If our hive table resides in on-premise hadoop cluster, we can run hive/spark-sql queries using subprocess in python and the query results will be returned back from the subprocess command as string. Thus we can process the string results. The solution can be implemented as following:

So this blog is to implement a solution to run dataproc query using API with output. There are at least three solutions to do that.

(3) Use cloud storage as a temporary space to dump the query results. By default, every dataproc job will dump its log and output in files in cloud storage. But since hive query in dataproc is using Beeline, there will be some other log information in the output file, thus it will not be easy to get the clean query results. However we can use a similar idea that insert overwrite the query results into a designated file in cloud storage, and then get the contents from cloud storage. The following part of this blog will explain this solution in detail. As assumption that the query results will be relatively small otherwise we will not want to run query and get the results back as string, even in on-premise hadoop cluster.

The python I used here is python2.7, and I used google.cloud, apilcient, oauth2client

After the execution of the dataproc job, we use cloud storage API to download the result blobs in cloud storage as string.

Add a comment

Related posts:

Select Clause References to Group Aggregate Function Results

Dear MySQL development team, please add support for SELECT clause references to group aggregate function results. This would greatly improve SQL statement readability. Of course, I can work around…

The Importance of Owning Your Identity

Later in life you start to become more comfortable shedding your social programming, you can begin to own your identity. This is the version of you that you prefer.

Top 10 Tech Trends for Programmers in 2021

For the software development industry and programmers 2020 has been a significant year with lots of breakthroughs in several areas. With the global pandemic, digitization has sped up remarkably, so…