This is another article in our "Solar Energy Data Analysis" series that analyzes data available from Open PV project website. The full dataset is huge so we decided to play with a subset of data coming from California. If you're not yet familiar with the Open PV project it's a collaborative initiative between government, industry and public sector that compiles PV installation data and makes it publicly available. The data used in this analysis is available by this link. It contains data about solar PV system installs in California from July 1998 to January 2016. The csv dataset contains many columns but for the sake of keeping the task more manageable we've only used cost_per_watt which is the total cost of the installation divided by the number of Watts the system is capable of producing. The analysis is done in Jupyter notebook using ipython so python code used to make the graph with plot.ly is also included.

import pandas as pd
data = pd.read_csv('/Users/staceyfinch/Downloads/openpv-export-201812042020.csv', parse_dates=[1])

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

Now let's check the dimensions of the dataset, that is how many rows and column does it have?

data.shape

(53835, 82)

For the purpose of this analysis we only use rows where the "cost_per_watt" column is not null so we only keep those rows.

df = data[pd.notnull(data['cost_per_watt'])]

df.shape

(40964, 82)

Now because the dataset is fairly large we'll use some statistical plotting to make sense of it. Don't worry if you're not that into statistics, we'll keep it simple and add some descriptions to make the plots easy to grasp.

import plotly.plotly as py
import plotly.graph_objs as go
trace = go.Box(
    y = df['cost_per_watt'],
    hoverlabel = dict(
        font = dict(
            size = 18
        )
    ),
    marker = dict(
        outliercolor = 'rgba(219, 64, 82, 0.6)',
        line = dict(
            outliercolor = 'rgba(219, 64, 82, 0.6)',
            outlierwidth = 2
        )
    )
)
layout = dict(
    title="Solar PV Installation Cost Per Watt Box Plot",
    yaxis = dict(
        tickprefix = "$",
        hoverformat = ".2f"
    ),
    autosize = True
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

The box plot shows the distribution of cost-per-watt (CPW) values. We can immediately see some interesting details about the data. First, for the majority of data CPW values are in the range of \$0.88 - \$10.78 That's a dramatic decrease in costs between 1998 and 2016. We also have some outliers dropping the cost to as low as \$0.76/watt and driving it up as high as \$29.77/watt. We can also plot each data point on a time series plot. The X axis would be the date of installation and the Y axis would be the cost-per-watt in USD.

trace_scatter = go.Scattergl(
    x = df['date_installed'],
    y = df['cost_per_watt'],
    mode = 'markers',
    hoverlabel = dict(
        bgcolor = "#ffffff"
    ),
    marker = dict(
        color = '#6497b1',
        line = dict(width = 1)
    )
)
layout = dict(
    title = "Cost/Watt distribution From 1998 to 2016 in California",
    xaxis = dict(
        range = ['1999-01-01', '2016-01-01'],
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1m',
                     step='month',
                     stepmode='backward'),
                dict(count=6,
                     label='6m',
                     step='month',
                     stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True
        )
        
    ),
    yaxis = dict(
        ticksuffix = "$",
        hoverformat = ".2f"
    )
)
fig = dict(data=[trace_scatter], layout=layout)
iplot(fig)

Many points overlap because the data is very dense but you can see the general trend of decreasing costs. To zoom in on a particular period of time use the sliders below the X asis. If you zoom in down to a month view you'll see vertical clusters of points for a particular date. We can condense the data by calculating the mean of the costs for a particular date and plotting it as a single point for that date. Here's how we do it

mn = df.groupby('date_installed')['cost_per_watt'].mean()

mean = go.Scattergl(
    name = 'Mean',
    x = mn.index,
    y = mn,
    mode = 'markers',
    hoverlabel = dict(
        bgcolor = "#ffffff",
        font = dict(
            size = 18
        )
    ),
    marker = dict(
        color='#92B0C4', 
        line=dict(width=1)
    )
)

dat = [mean]
layout = dict(
    xaxis = dict(
        title = "Use sliders to zoom in/out",
        range = ['2006-01-01', '2016-01-01'],
        autorange = True,
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label='1m',
                     step='month',
                     stepmode='backward'),
                dict(count=6,
                     label='6m',
                     step='month',
                     stepmode='backward'),
                dict(step='all')
            ])
        ),
        rangeslider=dict(
            visible = True,
            autorange=True
        )
    ),
    yaxis = dict(
        title = "Cost-per-Watt",
        tickprefix = "$",
        tickfont=dict(
            family='Old Standard TT, serif',
            size = 15
        ),
        hoverformat = ".2f"
    ),
    title = "Aggregate Cost-per-Watt Values By Date.<br>Hover for breakdown."
)
fig = go.Figure(
    data = dat, layout = layout
)
iplot(fig)

To declutter the plot we've grouped the CPW values by date and calculated the mean value of each group. You see on the graph that more solar PV installation reportings became available since August 2001. CPW downward movement over time is more pronounced on this graph. Another type of a statistical plot is histogram. Plotting data as a histogram can show us the CPW range that's featured in the highest percentage of all installations.

trace_histo = go.Histogram(
    x = df['cost_per_watt'],
    histnorm = 'percent',
    hoverlabel = dict(
        font = dict(
            size = 18
        )
    ),
    xbins = dict(
        start = 1,
        end = 11,
        size = 0.4
    )
)
layout = dict(
    title = "Solar PV Installation CPW Distribution Histogram in California",
    yaxis = dict(
        ticksuffix = "%",
        tickfont = dict(
            size = 15
        ),
        hoverformat = ".2f"
    ),
    xaxis = dict(
        tickprefix = "$",
        tickfont = dict(
            size = 15
        )
    )
)
fig = go.Figure(data=[trace_histo], layout=layout)
iplot(fig)

So 10.33% of all installations fall into the cost-per-watt range of USD \$5 - \$5.4

trace_date_histo = go.Histogram(
    x = df['date_installed'],
    histnorm = 'percent',
    xbins = dict(
        
        size = 'M12'
    ),
    autobinx = False
)
layout = dict(
    title = "Solar PV Installation Date Distribution Histogram in California",
    yaxis = dict(
        ticksuffix = "%",
        tickfont = dict(
            size = 15
        ),
        hoverformat = ".2f"
    ),
    xaxis = dict(
        tickfont = dict(
            size = 15
        )
    )
)
fig = go.Figure(data=[trace_date_histo], layout=layout)
iplot(fig)

Above histogram shows that, naturally enough, the number of yearly installatios of solar PV systems rose year after year since 1999 as a result of declining prices. A curious fact is that in 2014 the number of installations fall almost to the level 2008 but then rose to a new height in 2015. If we had data for the recent 3 years it would likely indicate even higher growth.