This is another article in our "Solar Energy Data Analysis" series that analyzes data available from Open PV project website. The full dataset is huge so we decided to play with a subset of data coming from California. If you're not yet familiar with the Open PV project it's a collaborative initiative between government, industry and public sector that compiles PV installation data and makes it publicly available. The data used in this analysis is available by this link. It contains data about solar PV system installs in California from July 1998 to January 2016. The csv dataset contains many columns but for the sake of keeping the task more manageable we've only used cost_per_watt which is the total cost of the installation divided by the number of Watts the system is capable of producing. The analysis is done in Jupyter notebook using ipython so python code used to make the graph with plot.ly is also included.
In [267]:
import pandas as pd
data = pd.read_csv('/Users/staceyfinch/Downloads/openpv-export-201812042020.csv', parse_dates=[1])

In [242]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

Now let's check the dimensions of the dataset, that is how many rows and column does it have?
In [268]:
data.shape

Out[268]:
(53835, 82)
For the purpose of this analysis we only use rows where the "cost_per_watt" column is not null so we only keep those rows.
In [269]:
df = data[pd.notnull(data['cost_per_watt'])]

In [272]:
df.shape

Out[272]:
(40964, 82)
Now because the dataset is fairly large we'll use some statistical plotting to make sense of it. Don't worry if you're not that into statistics, we'll keep it simple and add some descriptions to make the plots easy to grasp.
In [314]:
import plotly.plotly as py
import plotly.graph_objs as go
trace = go.Box(
y = df['cost_per_watt'],
hoverlabel = dict(
font = dict(
size = 18
)
),
marker = dict(
outliercolor = 'rgba(219, 64, 82, 0.6)',
line = dict(
outliercolor = 'rgba(219, 64, 82, 0.6)',
outlierwidth = 2
)
)
)
layout = dict(
title="Solar PV Installation Cost Per Watt Box Plot",
yaxis = dict(
tickprefix = "$", hoverformat = ".2f" ), autosize = True ) data = [trace] fig = go.Figure(data=data, layout=layout) iplot(fig)  The box plot shows the distribution of cost-per-watt (CPW) values. We can immediately see some interesting details about the data. First, for the majority of data CPW values are in the range of \$0.88 - \$10.78 That's a dramatic decrease in costs between 1998 and 2016. We also have some outliers dropping the cost to as low as \$0.76/watt and driving it up as high as \$29.77/watt. We can also plot each data point on a time series plot. The X axis would be the date of installation and the Y axis would be the cost-per-watt in USD. In [294]: trace_scatter = go.Scattergl( x = df['date_installed'], y = df['cost_per_watt'], mode = 'markers', hoverlabel = dict( bgcolor = "#ffffff" ), marker = dict( color = '#6497b1', line = dict(width = 1) ) ) layout = dict( title = "Cost/Watt distribution From 1998 to 2016 in California", xaxis = dict( range = ['1999-01-01', '2016-01-01'], rangeselector=dict( buttons=list([ dict(count=1, label='1m', step='month', stepmode='backward'), dict(count=6, label='6m', step='month', stepmode='backward'), dict(step='all') ]) ), rangeslider=dict( visible = True ) ), yaxis = dict( ticksuffix = "$",
hoverformat = ".2f"
)
)
fig = dict(data=[trace_scatter], layout=layout)
iplot(fig)