Wednesday, 09. September 2020 03:57PM

How to use 'data-scripts' to setup a local Dask Scheduler and Dask Worker

Depending on the science background you're coming from, there is a good chance you'll need to process a large amount of data. data-scripts aims to turn AWS S3, Digital Ocean Spaces, Google Cloud Storage, and Microsoft Blog storage into a massive Data Lake. All four major storage provides implement industry best stardards which allows us to store and pull data from across the world assuming we already have an index of the files being stored. data-scripts will focus on how to quickly insert data into your algorithms while keeping a mindful awareness of how much it might cost to do that

data-scripts

data-scripts has scripts for managing massive amounts of data to help you or your organization focus more on science. data-scripts will focus on using paid service providers in order to help you calculate cost for your grant writing better. Further down in the RoadMap, well add access to free resources like Github, Bitbucket, or something else you'd like to see included? Please send us feedback to,

Dask

Dask is a flexible library for parallel computing in Python. In this tutorial we'll focus on data_scripts helpers to have you start a dask-scheduler and dask-worker on the machine running jupyter

Tutorial

Lets go ahead and import the libraries we'll need for the project and start the dask-scheduler and dask-worker. data_scripts is smart enough to know if dask-scheduler or dask-worker are running already. Don't be shy and feel like you must only run the following jupyter cell once, you may run it many times and it'll always return the same dask-scheduler and dask-worker

In [32]:
from distributed import Client, progress
from data_scripts import local

local.setup_dask()
Out[32]:
'Sch: 13939 - Work: 13960'

With the Dask scheduler and worker running locally, lets run a simply scripts that'll add 2 to every integer passed to the add() function

In [33]:
def add(x):
    return x + 2

client = Client('localhost:8786')
awe = client.map(add, [1,2,3])
[awe.result() for awe in awe]
Out[33]:
[3, 4, 5]

Results returned, lets clean up our machine to make sure we don't have any non-nessicary programs running in the background

In [ ]:
local.destroy_dask()