📜 ⬆️ ⬇️

Work with Anaconda on the example of searching for the correlation of cryptocurrency rates



The purpose of this article is to provide an easy introduction to data analysis using Anaconda. We will go through writing a simple Python script to extract, analyze and visualize data on various cryptocurrencies.

Step 1 - Setting up the work environment.


The only skills you need are a basic understanding of Python.

Step 1.1 - Install Anaconda
')
Anaconda distribution can be downloaded from the official website .
Installation takes place in standard Step-by-Step mode.

Step 1.2 - Setting up the project work environment

After Anaconda is installed, you need to create and activate a new environment to organize our dependencies.

Why use a medium? If you plan to develop several Python projects on your computer, it is useful to store dependencies (software libraries and packages) separately to avoid conflicts. Anaconda will create a special environment directory for the dependencies of each project, so that everything is organized and shared.

This can be done either via the command line

conda create --name cryptocurrency-analysis python=3.6 

 source activate cryptocurrency-analysis 

(Linux / macOS)

or

 activate cryptocurrency-analysis 

(Windows)

either through Anaconda Navigator



In this case, the environment is automatically activated.

Then you need to install the necessary dependencies NumPy , Pandas , nb_conda , Jupiter , Plotly , Quandl .

 conda install numpy pandas nb_conda jupyter plotly quandl 

either through Anaconda Navigator, alternately each package



This may take a few minutes.

Step 1.3 - Launch Jupyter Notebook

There is also an option via the command line jupyter notebook and open the browser at http://localhost:8888/

and through Anaconda Navigator



Step 1.4 - Import Dependencies

After you open an empty Jupyter Notebook, the first thing to do is import the required dependencies.

 import os import numpy as np import pandas as pd import pickle import quandl from datetime import datetime 

Then import and activate offline Plotly.

 import plotly.offline as py import plotly.graph_objs as go import plotly.figure_factory as ff py.init_notebook_mode(connected=True) 

Step 2 - Retrieving Bitcoin Pricing Data


Now that everything is set up, we are ready to begin extracting data for analysis. To begin with, we’ll get pricing data using the free Quandl API.

Step 2.1 - Define the Quandl function
To begin with, we define a function for loading and caching data sets from Quandl.

 def get_quandl_data(quandl_id): '''Download and cache Quandl dataseries''' cache_path = '{}.pkl'.format(quandl_id).replace('/','-') try: f = open(cache_path, 'rb') df = pickle.load(f) print('Loaded {} from cache'.format(quandl_id)) except (OSError, IOError) as e: print('Downloading {} from Quandl'.format(quandl_id)) df = quandl.get(quandl_id, returns="pandas") df.to_pickle(cache_path) print('Cached {} at {}'.format(quandl_id, cache_path)) return df 

We use pickle to serialize and save the loaded data as a file, which will allow our script not to reload the same data every time the script is run.

The function returns data as a pandas data set.

Step 2.2 - Getting a Bitcoin rate on the Kraken Exchange

We implement it as follows:

 btc_usd_price_kraken = get_quandl_data('BCHARTS/KRAKENUSD') 

To check the validity of the script, we can look at the first 5 lines of the response received using the head () method.

 btc_usd_price_kraken.head() 

Result:
DateOpenHighLowCloseVolume (BTC)Volume (Currency)Weighted Price
2014-01-07874.67040892.06753810.00000810.0000015.62237813151.472844841.835522
2014-01-08810.00000899.84281788.00000824.9828719.18275616097.329584839.156269
2014-01-09825.56345870.00000807.42084841.869348.1583356784.249982831.572913
2014-01-10839.99000857.34056817.00000857.330568.0245106780.220188844.938794
2014-01-11858.20000918.05471857.16554899.8410518.74828516698.566929890.671709

And build a graph to visualize the resulting array.

 btc_trace = go.Scatter(x=btc_usd_price_kraken.index, y=btc_usd_price_kraken['Weighted Price']) py.iplot([btc_trace]) 



Here we use Plotly to generate our visualizations. This is a less traditional choice than some of the more well-known libraries, such as Matplotlib, but I think Plotly is a great choice because it creates fully interactive diagrams using D3.js.

Step 2.3 - Getting Bitcoin Rate on Multiple Exchanges

The nature of the exchange is that pricing is determined by supply and demand, therefore, no stock exchange contains the “true price” of Bitcoin. To solve this problem, we will extract additional data from three larger exchanges to calculate the total price index.

We will upload the data of each exchange to the dictionary.

 exchanges = ['COINBASE','BITSTAMP','ITBIT'] exchange_data = {} exchange_data['KRAKEN'] = btc_usd_price_kraken for exchange in exchanges: exchange_code = 'BCHARTS/{}USD'.format(exchange) btc_exchange_df = get_quandl_data(exchange_code) exchange_data[exchange] = btc_exchange_df 

Step 2.4 - Combining all prices into a single data set

Define a simple function to merge data.

 def merge_dfs_on_column(dataframes, labels, col): series_dict = {} for index in range(len(dataframes)): series_dict[labels[index]] = dataframes[index][col] return pd.DataFrame(series_dict) 

Then combine all the data on the column "Weighted Price".

 btc_usd_datasets = merge_dfs_on_column(list(exchange_data.values()), list(exchange_data.keys()), 'Weighted Price') 

Now we’ll look at the last five lines, using the tail () method to make sure everything looks fine and the way we wanted.

 btc_usd_datasets.tail() 

Result:
DateBitstampCOINBASEITBITKrakenavg_btc_price_usd
2018-02-2810624.38289310643.05357310621.09942610615.58798710626.030970
2018-03-0110727.27260010710.94606410678.15687210671.65395310697.007372
2018-03-0210980.29865810982.18188110973.43404510977.06790910978.245623
2018-03-0311332.93446811317.10826211294.62076311357.53909511325.550647
2018-03-0411260.75125311250.77121111285.69072511244.83646811260.512414

Step 2.5 - Comparing price data sets.

The next logical step is to visualize the comparison of prices received. To do this, we define a helper function that will build a graph for each of the exchanges using Plotly.

 def df_scatter(df, title, seperate_y_axis=False, y_axis_label='', scale='linear', initial_hide=False): label_arr = list(df) series_arr = list(map(lambda col: df[col], label_arr)) layout = go.Layout( title=title, legend=dict(orientation="h"), xaxis=dict(type='date'), yaxis=dict( title=y_axis_label, showticklabels= not seperate_y_axis, type=scale ) ) y_axis_config = dict( overlaying='y', showticklabels=False, type=scale ) visibility = 'visible' if initial_hide: visibility = 'legendonly' trace_arr = [] for index, series in enumerate(series_arr): trace = go.Scatter( x=series.index, y=series, name=label_arr[index], visible=visibility ) if seperate_y_axis: trace['yaxis'] = 'y{}'.format(index + 1) layout['yaxis{}'.format(index + 1)] = y_axis_config trace_arr.append(trace) fig = go.Figure(data=trace_arr, layout=layout) py.iplot(fig) 

And call her

 df_scatter(btc_usd_datasets, '    (USD) ') 

Result:



Now we will remove all zero values, since we know that the price has never been zero in the period we are considering.

 btc_usd_datasets.replace(0, np.nan, inplace=True) 

And re-create the schedule

 df_scatter(btc_usd_datasets, 'Bitcoin Price (USD) By Exchange') 

Result:



Step 2.6 - Calculate Average Price

Now we can calculate a new column containing the average daily bitcoin price on all exchanges.

 btc_usd_datasets['avg_btc_price_usd'] = btc_usd_datasets.mean(axis=1) 

This new column is our bitcoin price index. Build his schedule to make sure that he looks normal.

 btc_trace = go.Scatter(x=btc_usd_datasets.index, y=btc_usd_datasets['avg_btc_price_usd']) py.iplot([btc_trace]) 

Result:



We will use this data later to convert other cryptocurrency exchange rates to USD.

Step 3 - Acquiring data on alternative cryptocurrencies



Now that we have an array of data with Bitcoin prices, let's take some data about alternative cryptocurrencies.

Step 3.1 - Define functions for working with the Poloniex API.

We will use the Poloniex API to get the data. We define two helper functions for loading and caching JSON data from this API.

First, we define the get_json_data function, which will load and cache JSON data from the provided URL.

 def get_json_data(json_url, cache_path): try: f = open(cache_path, 'rb') df = pickle.load(f) print('Loaded {} from cache'.format(json_url)) except (OSError, IOError) as e: print('Downloading {}'.format(json_url)) df = pd.read_json(json_url) df.to_pickle(cache_path) print('Cached response at {}'.format(json_url, cache_path)) return df 

Then we define a function to format the HTTP requests for the Poloniex API and call our new function get_json_data to save the received data.

 base_polo_url = 'https://poloniex.com/public?command=returnChartData&currencyPair={}&start={}&end={}&period={}' start_date = datetime.strptime('2015-01-01', '%Y-%m-%d') end_date = datetime.now() pediod = 86400 def get_crypto_data(poloniex_pair): json_url = base_polo_url.format(poloniex_pair, start_date.timestamp(), end_date.timestamp(), pediod) data_df = get_json_data(json_url, poloniex_pair) data_df = data_df.set_index('date') return data_df 

This input function receives a pair of cryptocurrencies, for example, “BTC_ETH” and returns historical data on the exchange rate of two currencies.

Step 3.2 - Downloading data from Poloniex

Some of the alternative cryptocurrencies in question cannot be bought on exchanges directly for USD. For this reason, we will upload the bitcoin exchange rate for each of them, and then we will use the existing bitcoin pricing data to convert this value to USD.

We upload exchange data for nine popular cryptocurrencies - Ethereum , Litecoin , Ripple , Ethereum Classic , Stellar , Dash , Siacoin , Monero , and NEM .

 altcoins = ['ETH','LTC','XRP','ETC','STR','DASH','SC','XMR','XEM'] altcoin_data = {} for altcoin in altcoins: coinpair = 'BTC_{}'.format(altcoin) crypto_price_df = get_crypto_data(coinpair) altcoin_data[altcoin] = crypto_price_df 

Now we have 9 data sets, each of which contains historical average daily stock exchange ratios of bitcon to alternative cryptocurrency.

We can look at the last few lines of the Ethereum pricing table to make sure it looks normal.

 altcoin_data['ETH'].tail() 

dateclosehighlowopenquoteVolumevolumeweightedAverage
2018-03-010.0797350.0829110.0792320.08272917981.7336931454.2061330.080871
2018-03-020.0775720.0797190.0770140.07971918482.9855541448.7327060.078382
2018-03-030.0745000.0776230.0743560.07756215058.8256461139.6403750.075679
2018-03-040.0751110.0776300.0743890.07450012258.662182933.4809510.076149
2018-03-050.0753730.0757000.0747230.07527710993.285936826.5766930.075189

Step 3.3 - Price Conversion to USD.

Since we now have a bitcoin exchange rate for each cryptocurrency and we have an index of historical Bitcoin prices in USD, we can directly calculate the price in USD for each alternative cryptocurrency.

 for altcoin in altcoin_data.keys(): altcoin_data[altcoin]['price_usd'] = altcoin_data[altcoin]['weightedAverage'] * btc_usd_datasets['avg_btc_price_usd'] 

By this we have created a new column in each alternative cryptocurrency data set with prices in USD.

Then we can reuse our function merge_dfs_on_column to create a combined price data set in USD for each of the cryptocurrencies.

 combined_df = merge_dfs_on_column(list(altcoin_data.values()), list(altcoin_data.keys()), 'price_usd') 

Now add the bitcoin price to the dataset as the final column.

 combined_df['BTC'] = btc_usd_datasets['avg_btc_price_usd'] 

As a result, we have a data set containing daily prices in USD for ten cryptocurrencies, which we are considering.

We use our df_scatter function to display all cryptocurrency prices on a chart.

 df_scatter(combined_df, '  (USD)', seperate_y_axis=False, y_axis_label='(USD)', scale='log') 

This chart provides a fairly solid "big picture" of how the exchange rates of each currency have changed over the past few years.



In this example, we use the logarithmic scale of the Y axis to compare all currencies in the same area. You can try different parameter values ​​(for example, scale = 'linear') to get different points of view on the data.

Step 3.4 - Calculation of cryptocurrency correlation.

You may notice that cryptocurrency exchange rates, despite their completely different values ​​and volatility, seem slightly correlated. And as seen from the surge in April 2017, even small fluctuations seem to occur synchronously in the entire market.

We can test our correlation hypothesis using the Pandas corr () method, which calculates the Pearson correlation coefficient for each column in a data set relative to each other. In the calculation, we also use the pct_change () method, which converts each cell in the dataset from the absolute value of the price to the percentage change.

First, we calculate the correlations for 2016.

 combined_df_2016 = combined_df[combined_df.index.year == 2016] combined_df_2016.pct_change().corr(method='pearson') 

Result:
DASHEtcEthLTCSCSTRXEMXmrXrpBtc
DASH1.0000000.0039920.122695-0.0121940.0266020.0580830.0145710.1215370.088657-0.014040
Etc0.0039921.000000-0.181991-0.131079-0.008066-0.102654-0.080938-0.105898-0.054095-0.170538
Eth0.122695-0.1819911.000000-0.0646520.1696420.0350930.0432050.0872160.085630-0.006502
LTC-0.012194-0.131079-0.0646521.0000000.0122530.1135230.1606670.1294750.0537120.750174
SC0.026602-0.0080660.1696420.0122531.0000000.1432520.1061530.0479100.0210980.035116
STR0.058083-0.1026540.0350930.1135230.1432521.0000000.2251320.0279980.3201160.079075
XEM0.014571-0.0809380.0432050.1606670.1061530.2251321.0000000.0164380.1013260.227674
Xmr0.121537-0.1058980.0872160.1294750.0479100.0279980.0164381.0000000.0276490.127520
Xrp0.088657-0.0540950.0856300.0537120.0210980.3201160.1013260.0276491.0000000.044161
Btc-0.014040-0.170538-0.0065020.7501740.0351160.0790750.2276740.1275200.0441611.000000

Coefficients close to 1 or -1 mean that the data correlate strongly or inversely correlate, respectively, and coefficients close to zero mean that the values ​​tend to fluctuate independently of each other.

To visualize these results, we will create another helper function.

 def correlation_heatmap(df, title, absolute_bounds=True): heatmap = go.Heatmap( z=df.corr(method='pearson').as_matrix(), x=df.columns, y=df.columns, colorbar=dict(title='Pearson Coefficient'), ) layout = go.Layout(title=title) if absolute_bounds: heatmap['zmax'] = 1.0 heatmap['zmin'] = -1.0 fig = go.Figure(data=[heatmap], layout=layout) py.iplot(fig) 

 correlation_heatmap(combined_df_2016.pct_change(), "  (2016)") 



Here, the dark red values ​​represent strong correlations, and the blue values ​​represent strong inverse correlations. All other colors represent different degrees of weak / non-existent correlations.

What does this chart tell us? In fact, this shows that there was very little statistically significant connection between how prices of different cryptocurrencies fluctuated during 2016.

Now, to test our hypothesis that cryptotermines have become more correlated in recent months, we repeat the same tests using data for 2017 and 2018.

 combined_df_2017 = combined_df[combined_df.index.year == 2017] combined_df_2017.pct_change().corr(method='pearson') 

Result:
DASHEtcEthLTCSCSTRXEMXmrXrpBtc
DASH1.0000000.3875550.5069110.3401530.2914240.1830380.3259680.4984180.0911460.307095
Etc0.3875551.0000000.6014370.4820620.2984060.2103870.3218520.4473980.1147800.416562
Eth0.5069110.6014371.0000000.4376090.3730780.2593990.3992000.5546320.2123500.410771
LTC0.3401530.4820620.4376091.0000000.3391440.3075890.3790880.4372040.3239050.420645
SC0.2914240.2984060.3730780.3391441.0000000.4029660.3313500.3786440.2438720.325318
STR0.1830380.2103870.2593990.3075890.4029661.0000000.3395020.3274880.5098280.230957
XEM0.3259680.3218520.3992000.3790880.3313500.3395021.0000000.3360760.2681680.329431
Xmr0.4984180.4473980.5546320.4372040.3786440.3274880.3360761.0000000.2266360.409183
Xrp0.0911460.1147800.2123500.3239050.2438720.5098280.2681680.2266361.0000000.131469
Btc0.3070950.4165620.4107710.4206450.3253180.2309570.3294310.4091830.1314691.000000

 correlation_heatmap(combined_df_2017.pct_change(), "  (2017)") 



 combined_df_2018 = combined_df[combined_df.index.year == 2018] combined_df_2018.pct_change().corr(method='pearson') 

DASHEtcEthLTCSCSTRXEMXmrXrpBtc
DASH1.0000000.7755610.8565490.8479470.7331680.7172400.7691350.9130440.7796510.901523
Etc0.7755611.0000000.8088200.6674340.5308400.5512070.6417470.6960600.6376740.694228
Eth0.8565490.8088201.0000000.7007080.6248530.6303800.7523030.8168790.6521380.787141
LTC0.8479470.6674340.7007081.0000000.6837060.5966140.5936160.7659040.6441550.831780
SC0.7331680.5308400.6248530.6837061.0000000.6152650.6951360.6260910.7194620.723976
STR0.7172400.5512070.6303800.5966140.6152651.0000000.7904200.6428100.8540570.669746
XEM0.7691350.6417470.7523030.5936160.6951360.7904201.0000000.7443250.8297370.734044
Xmr0.9130440.6960600.8168790.7659040.6260910.6428100.7443251.0000000.6680160.888284
Xrp0.7796510.6376740.6521380.6441550.7194620.8540570.8297370.6680161.0000000.712146
Btc0.9015230.6942280.7871410.8317800.7239760.6697460.7340440.8882840.7121461.000000

 correlation_heatmap(combined_df_2018.pct_change(), "  (2018)") 



And here we see what we assumed - almost all cryptocurrencies have become more interconnected with each other in all directions.

At this point, we assume that the introduction to working with data in Anaconda has been successfully passed.

Source: https://habr.com/ru/post/350500/


All Articles