Probability Denisty Functions

Check out my main page

This one is going to serve as a guide to making simple probability density function (PDFs) in R. I briefly covered this process in a past guide, but this is probably my favorite type of figure to construct so I wanted to make a dedicated page. This page is primarily a guide to wrangling the data into the format you need and then making PDFs in a number of ways. I won’t go into too much detail about the actual statistics involved.

Plotting a PDF is powerful because it tells us the distribution of a given variable and allow us to determine the probability a certain variable will have a certain value. We can also use PDFs to calculate standard deviations of a given variable in a data set.

The PDF is much like a histogram, but differs in the fact that a PDF is only useful for variables that exist in a continuous sequence. In other words, you can plot daily mean temperature with a PDF, but you cannot use a PDF to plot the number of days in which a certain temperature is met. This is because by nature of the PDF, the spaces between the data points are being filled in. If you are trying to plot the number of days in which a given threshold is surpassed a year, you will have whole numbers (3 days/yr, 7 days/year, 5 days/year etc). The PDF assumes that you can have 3.5 days/year or 7.23448 days/year. Thus, you have to question whether or not the variable you want to plot makes sense in a PDF.

There are plenty of resources for more information about PDFs. Jim Frost has an excellent statistics blog, which goes into more detail about PDFs.

Data Formatting

You can make a PDF with a wide array of data sets and variables. In this case I’m going to stick with my meteorologic data. Ill be using a 72 year long data set from Albany International Airport, which is hosted on my GitHub page. I discussed how to download this type of meteorologic data in a past guide.

library(ggplot2)
library(ggridges)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(patchwork)

Alb_df<- read.csv("https://raw.githubusercontent.com/Plummquat/IanPlummer/gh-pages/albany.csv")

This data set contains daily meteorologic data from 1950 to 2022. We can actually plot this data as a PDF as is, however, well see later that it will be beneficial to create some summary data frames. Mainly we want to get monthly mean temperatures and precipitation totals instead of raw daily data.

Alb_Monthly<- Alb_df %>% 
  group_by(MONTH, YEAR) %>% 
summarize(MeanTemp = mean(TEMP, na.exclude=T), MeanMax = mean(MAX, na.exclude=T), MeanMin = mean(MIN, na.exclude=T), TotalP = sum(PRCP, na.exclude=T), MeanWSP = mean(WDSP, na.exclude=T), MeanHWSP = mean(MXSPD, na.exclude=T))

## `summarise()` has grouped output by 'MONTH'. You can override using the
## `.groups` argument.

#above code organizes the daily data by month and year and then calculates the mean and sum ov the varius variables we input. You need to use the na.exclude() or na.rm() line to make sure the code ignores NA inputs. While missing data is important to note, it can sometimes be ignored. For example, a few days of missing wind data is annoying but is unlikely to significantly skew a 72 year long data set.

We are also going to want to create some categories to explore the changes in these variables through time. Generally, I like to separate meteorologic data of this sort into data collected before and after 1980. This is because 1980 is generally excepted as the period of time in which global temperatures began to rapidly rise as a result of anthropogenic forcing. This didn’t occur until the early 1980s because of global dimming, which was brought on by anthropogenic aerosol emissions.

Alb_df$Period<- case_when(Alb_df$YEAR >= 1950 & Alb_df$YEAR < 1980 ~ "1950-1980", 
                          Alb_df$YEAR >= 1980 ~ "1980-Present")

Alb_Monthly$Period<- case_when(Alb_Monthly$YEAR >= 1950 & Alb_Monthly$YEAR < 1980 ~ "1950-1980", 
                          Alb_Monthly$YEAR >= 1980 ~ "1980-Present")

#Above lines of code create a new column called "Period" in the two data frames we made previously. It will assign a value of either "1950-1980" or "1980-Present" depending on the year associated with each data point.

Plotting the PDFs

Raw daily data

We’ll start off by plotting the PDFs of the raw daily data to see what it looks like.

Alb_df %>% 
  ggplot(aes(x = TEMP)) + #assign an X value.
  geom_density(fill = "gray") + #Assign a fill color to the data, this is not mandatory as R will input a value                                     automatically, however, it tends to pick some ugly colors by default. 
  theme_classic() + #Sets the theme, there are many you can choose from so feel free to play around with it.
  xlab("Temperature (°C)") #Set the name of the X axis.

This data seems to display a strong bi-modal distribution, meaning that we essentially have 2 peaks in probability instead of a simple bell curve. Temperature in temperate regions does actually often display this trend. It is mainly caused by the fact that Albany has some harsh and cold winters and fairly warm summers. The transitional seasons have less of an impact on this PDF than the cold and hot seasons do.

We can do the same analysis for the maximum and minimum temperatures and plot them along with the mean temperatures to compare them.

MeanT_fig<- Alb_df %>% 
  ggplot(aes(x = TEMP)) + 
  geom_density(fill = "orange") + 
  theme_classic() + 
  xlim(-30,40)+ #Sets the limits of the X axis,I did this to look at Tmean, Tmax, and Tmin on the same scale.
  xlab("Mean Temperature (°C)")

MaxT_fig<- Alb_df %>% 
  ggplot(aes(x = MAX)) + 
  geom_density(fill = "red") + 
  theme_classic() + 
  xlim(-30,40)+ 
  xlab("Max Temperature (°C)")

MinT_fig<- Alb_df %>% 
  ggplot(aes(x = MIN)) + 
  geom_density(fill = "blue") + 
  theme_classic() + 
  xlim(-30,40)+ 
  xlab("Min Temperature (°C)")

Combo_fig<- MaxT_fig/ MeanT_fig /  MinT_fig #This line uses patchwork to stitch the 3 figures we made before

Now we can clearly see some differences when compared to the PDF of the daily mean temperatures. While Max and Min temperatures also have a bi-modal distribution, there is some variation.

We could spend all day discussing the temperature PDFs but we have many more to make. Let’s try a more complex variable like daily precipitation totals.

Alb_df %>% 
  ggplot(aes(x = PRCP)) + 
  geom_density(fill = "gray") + 
  theme_classic() + 
  scale_x_log10() + #Sets the Xaxis to a log scale
  xlab("Precipitation (mm)")

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 14210 rows containing non-finite values (`stat_density()`).

This PDF is clearly skewed to the left. It was so skewed in fact, that we had to use a log scale on the X axis to actually see the data. This is simply due to the nature of precipitation. In a region like the northeast, a significant number of days of the year will have no rainfall. Thus, the PDF is going to be showing that 0 mm is the most common value for this variable. Despite how askew this PDF is, it does have some important uses that will be touched on later.

Monthly PDFs

Now that we have seen how raw daily data plots, we should also look at how monthly data plots. We know that temperature varies per season and the previous temperature PDFs showed a clear bi-modal signal that is likely seasonal. We can dissect this signal by separating the data by the month of the year.

Before we make the plot, there is one optional piece of data formatting we should do. The data is stored in a way that the months are a series of numbers (1-12). This makes graphing easy but the end result will not look all that professional. We can rectify this by creating a new column where the month number is equal to the month name.

Alb_Monthly$MonthName<- case_when(Alb_Monthly$MONTH == 1 ~ "January",
                                  Alb_Monthly$MONTH == 2 ~ "February",
                                  Alb_Monthly$MONTH == 3 ~ "March",
                                  Alb_Monthly$MONTH == 4 ~ "April",
                                  Alb_Monthly$MONTH == 5 ~ "May",
                                  Alb_Monthly$MONTH == 6 ~ "June",
                                  Alb_Monthly$MONTH == 7 ~ "July",
                                  Alb_Monthly$MONTH == 8 ~ "August",
                                  Alb_Monthly$MONTH == 9 ~ "September",
                                  Alb_Monthly$MONTH == 10 ~ "October",
                                  Alb_Monthly$MONTH == 11 ~ "November",
                                  Alb_Monthly$MONTH == 12 ~ "December",)
#Above code makes a new column in the Alb_Monthly data frame where the month name is equal to its associated numeric value 
Alb_Monthly$MonthName<- factor(Alb_Monthly$MonthName, levels = c("December", "November", "October", "September", "August", "July", "June", "May", "April", "March", "February", "January"))
#R does not know how to organize months so it defaults to alphabetic order. We have to use factor() to tell R the order it should plot these month values in. Note that I did this in reverse chronological order, this is because the specific graph were using works in reverse order. Most graphs use normal chronological order so you would need to reverse the order as above if you wanted to make other plots.

Now we can make the PDF that is separated by month.

Alb_Monthly %>% 
  ggplot(aes(x = MeanTemp, y = MonthName)) + #Same as before but were adding a Y argument. 
  geom_density_ridges() + #this command plots a density ridge instead of just a density plot. 
  ylab("Month")+ 
  theme_classic() +  
  xlab("Temperature (°C)")

## Picking joint bandwidth of 0.636

Look at that! We can clearly see that different months have different temperature distributions. Late spring and early summer months have those strong “bell curve” normal distributions. The late summer has slightly left skewed distributions. Finally, the fall, winter, and early spring have so much more temperature variability that the PDFs aren’t smooth.

Now lets revisit the precipitation PDFs. Using the monthly sums we will theoretically no longer see as strong a skew to the left.

Alb_Monthly %>% 
  ggplot(aes(x = TotalP, y = MonthName)) + 
  geom_density_ridges(aes(fill= MonthName)) + #To add some color we can use the fill command to tell R to fill the                                                PDFs a different color for each month
  ylab("Month")+ 
  theme_classic() +  
  xlab("Precipitation (mm)")

## Picking joint bandwidth of 14.7

## Warning: Removed 72 rows containing non-finite values (`stat_density_ridges()`).

From this we can tell that there is not a strong seasonal cycle to precipitation here in the northeast. It does, however, look like we see a greater probability of extreme precipitation events in the summer and fall. Given the location of Albany this is likely due to Atlantic derived moisture. If we had data in a region impacted by a seasonal monsoon we would see a figure that looks a lot like the temperature PDF, with rainfall increasing in the warm months.

Now I want to look at one more seasonal variable that is often left forgotten, wind.

Alb_Monthly %>% 
  ggplot(aes(x = MeanWSP, y = MonthName)) + 
  geom_density_ridges() +
  ylab("Month")+ 
  theme_classic() +  
  xlab("Wind Speed (m/sec)")

## Picking joint bandwidth of 0.198

The PDFs allow us to see that there is clearly a seasonal change in the distribution of mean wind speeds, with higher winds in the winter and lower in the summer.

Adding a time component

Now that we have explored how to make the PDFs and separate them by month, the last thing I want to show you is how to add a component of time to this analysis. We know that with anthropogenic climate change, many of these variable have already changed and will continue to change. As mentioned previously, global temperatures begin to rapidly rise in the early 1980s, meaning we can use this period of time as a sort of buffer point. The pre-1980s can be seen as the period of minimal anthropogenic signal, while the post 1980s can be seen as the one more heavily influenced by anthropogenic forcing.

Let’s look at daily temperature data before and after 1980.

Alb_df %>% 
  ggplot(aes(x = TEMP)) + 
  geom_density(aes(fill = Period), alpha = 0.7) + #We tell R to fill based on the Period column we made earlier
  scale_fill_manual(values = c("gray", "coral")) + #Just some nice colors 
  theme_classic() + 
  xlab("Temperature (°C)")

Here we can see the change in the distribution of daily temperatures as a result of anthropogenic warming. It is clear that the distribution has skewed a bit to the right in the last 40 years. In essence, the probability of daily temperatures being warmer has increased, while the probability of them being cold has decreased.

Lets decrease the seasonal noise by plotting this for the monthly data.

Alb_Monthly %>% 
  ggplot(aes(x = MeanTemp, y = MonthName)) + 
  geom_density_ridges(aes(fill = Period), alpha = 0.7) +
  scale_fill_manual(values = c("gray", "coral"))+ 
  ylab("Month")+ 
  theme_classic() +  
  xlab("Temperature (°C)")

## Picking joint bandwidth of 0.717

Now that is cool! Most months have seen their respective PDFs shift to the right. December has shown the most substantial shift to warmer mean temperatures. We can also see a decrease in variability in some months like April and May. The only months where the PDFs skew to the left are September and October, although these cooler mean temperatures appear to be balanced out by extending of the high temperature limb of the PDFs. The results from these PDFs are quite similar to the scatter plots that I have made a [guide]((https://plummquat.github.io/IanPlummer/Temp_Plots.html) for in the past. The advantage of those scatter plots is that they allow for actual statistical calculation of the significance of these trends.

You can also plot different variables like precipitation and wind to see if there have been changes in any of them through time.

Alb_Monthly %>% 
  ggplot(aes(x = MeanWSP, y = MonthName)) + 
  geom_density_ridges(aes(fill = Period), alpha = 0.7) +
  scale_fill_manual(values = c("gray", "coral"))+ 
  ylab("Month")+ 
  theme_classic() +  
  xlab("Temperature (°C)")

## Picking joint bandwidth of 0.194

All of this is quite interesting and I could spend endless hours making interpretations and discussing. However, I think that about wraps up this guide. Hopefully you can use this code to help you in your own work and analysis.