Disclaimer: This page is going to be catered to folks that have somewhat of an understanding of R and how it works. That being said, all of this is going to be fairly easy to replicate and you only need to download R and RStudio and merely copy and paste these chunks of code into the terminal to do it yourself. I apologize for not making it easier to customize.

Hello all! My initial inspiration for this particular data-vis stems from some really good work done by Brendan Wiltse of Paul Smith’s College. Ill link his work below.

I hear talk about how nice the weather has been lately in the #adirondacks. Here is a plot of the ADK3 climate data (pub linked below). Curves are GAMs, red is periods of sig ⬆️ in temp. Sept & Nov temps on the rise, Oct trending up as well. This is our climate future. 1/2 pic.twitter.com/pR4iS3vQdM

— Brendan Wiltse (@BrendanWiltse) November 13, 2022

I thought this particular figure was a really neat way to show a time series, especially in consideration of monthly timescales. I wanted to find a way to go about replicating it on my own, mainly to challenge my coding skills. The following will allow you to follow along if you want to create this figure for an area near you. This can theoretically be done with any temperature data set from anywhere in the world. I decided to use a repository that I am familiar with but this code should be applicable to many other data sets with minor adjustments.

Required Packages

For those not as familiar with R, the best way to download packages would be to go to the “Packages” tab in the right hand side of the RStudio window and type the names of these packages. For everyone else, it is likley that you have most of these packages downloaded already.

library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.1.2
library(lubridate)
library(dplyr)
library(patchwork)
library(ggplot2)
library(GSODR) #This is the neat package to extract the data sets

Extracting Data With GSODR

Im going to be taking data from the Global Surface Summary of the Day, which is curated by the National Centers for Environmental Information (NCEI). NCEI is a well trusted source of weather data and has stations all over the world. That being said, there are some quirks with this system that we are going to have to deal with in order to work with the data.

As mentioned, GSODR allows you to request data from stations all over the world, some of which have been recording for over 70 years. There are a number of ways to find station metadata, most of which are highlighted here but I will also briefly go over some of the more useful options.

If you simply want data from stations that are near you or a location of interest you can use the ‘nearest_stations’ command. This command allows you to plug in a Lat and Long for your location of interest and then a radius around said location. While this is a fairly easy command to follow, I find it better to actually explore the station locations graphically.

load(system.file("extdata", "isd_history.rda", package = "GSODR"))
library(leaflet)

map_gsod <- isd_history  %>% leaflet() %>% 
  addProviderTiles(providers$OpenStreetMap) %>% 
  addMarkers(~ LON, ~ LAT, popup = ~ STNID, clusterOptions = T)

map_gsod

This map allows you to zoom in on a given area and see all the stations that are in the GSODR system. I want to work with data from NYC and am particularly interested in Queens and Long Island. I have plenty of options so a flip of a coin leads me to LaGuardia Airport. No you will notice that many of these areas have 2 stations plotted in the same area. I am not sure as to the exact reason that GSODR does this, but longer data sets are split into two smaller data sets. This adds a small complication that is easy to deal with. Were going to use the ‘get_GSOD’ command to extract data. You are going to have to play with the command a bit to figure out which station is the older/younger part of the record. I recommend trying to extract the whole record from one station and it will likely through an error that says something along the lines of “this station only has data from XXXX to YYYY”. That range will allow you to find the dates that can be extracted from both p[arts of the data set.

met_df <- get_GSOD(years = 1946:2022, station = "725180-14735") #Command to get data from the younger portion of the LaGuardia record
## 
## This station, 725180-14735, only provides data for years 1946 to 2021.
met_df2 <- get_GSOD(years = 1945:1972, station = "999999-14735")#Command to get data from the older portion of the LaGuardia record

Great, now we have the two data sets from the LaGaurdia station and all we need to do is combine them to create a full record in one data set.

Data Exploration

Now that we have thew data we may want to take an exploratory look to make sure everything looks good. For this im just going to plot the maximum and minimum temperatures for the whole record.

queens_df %>%
  ggplot() +
  geom_line(aes(x = YEARMODA, y = MIN), color = "blue") +
  geom_line(aes(x = YEARMODA, y = MAX), color = "red") +
  labs(title = met_df$NAME[1],
       x = "Date",
       y = "Temperature (C)",
       caption = paste("Data source GSOD station", met_df$STNID[1])
       ) +
  theme_classic()

That looks great! As a brief aside, I have recently been seeing this kind of graph being used by climate change deniers to provide “evidence” that temperatures aren’t increasing. In fact, I have seen the argument be “if you look at the temperature ranges there is no trend! Scientists use these ‘anomalies’ to trick you into thinking there are dramatic warming trends when the raw data doesn’t support this”. This is a frustratingly bad argument but I want to take a second to clarify it. The reason that monthly and annual averages as well as anomalies are used is because they tell the more relevant information. Any given year is going to have high and low temperature extremes and while there is evidence of high and low temp extremes increasing in value, there is more nuance and variability in these values. You will still have cold days in the winter, but that does not mean that regional temperatures aren’t increasing. To put this in other words, take a data set of 4 numbers(2, 4, 4, 10) the mean of this data is 4. Now lets take another 4 number data set (2, 8, 8, 10), despite having the same range of temperatures, the mean value of this data set is 7. We can even change the range to expand the low end meber (0, 8, 8, 10) and still get a mean above the original data set (6.5). This is exactly why we take averages of data. It is not the “great lie” it is basic statistics. Furthermore, annual temperatures can vary by over 40 degrees Celsius, thus plotting an elongated data set of raw temperatures is going to graphically squash most trends, whether they be upwards or downwards.

Soap boxing about climate and data literacy over, now we can get to work creating those aforementioned mean values. It is also potentially useful to look at precipitation values so we can add those to this new dataset as well.

nydf<- queens_df %>%
  group_by(YEAR, MONTH) %>%
  summarize(meanT= mean(TEMP), P = sum(PRCP))
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups` argument.
nydf

When looking at the data you may notice that there are a lot of ‘NA’ values for the precipitation column. This is because the station wasn’t recording accurate enough precipitation data until the 70s. This is a shame but there isn’t much we can do about it. Depending on the station you use you will find similar issues. NCEI will exclude data that is suspect or clearly not accurate, which may also explain some gaps. Interestingly, a complete record of precipitation at LaGuardia can be found on the NWS website, so you could also go through the effort of downloading data from them. The only reason I am not is because it is more difficult to automate that process and this was supposed to serve as a simple guide to creatiung these types of plots.

Next it is useful to compile the monthly means for each of these variables.

nyfm<- queens_df %>%
  group_by(MONTH) %>%
  summarize(MT= mean(TEMP))
nyfm

Here comes some of the shortcomings to my coding skills. There is likely a better way to do this but I could only come up with the “brute force” method. The following code is going to create a column in this data set of the mean monthly values, which can probably be automated. I was getting frustrated doing that and ended up hard coding the values we calculated in the previous code into the data set. While certainly not the prettiest way to do it, this method was simple enough and worked as intended.

nydf$monthly<-case_when(nydf$MONTH == 1 ~nyfm$MT[1],
            nydf$MONTH == 2 ~nyfm$MT[2], 
            nydf$MONTH == 3 ~nyfm$MT[3],
             nydf$MONTH == 4 ~nyfm$MT[4],
             nydf$MONTH == 5 ~nyfm$MT[5],
             nydf$MONTH == 6 ~nyfm$MT[6], 
             nydf$MONTH == 7 ~nyfm$MT[7],
             nydf$MONTH == 8 ~nyfm$MT[8],
             nydf$MONTH == 9 ~nyfm$MT[9],
             nydf$MONTH == 10 ~nyfm$MT[10],
             nydf$MONTH == 11 ~nyfm$MT[11],
             nydf$MONTH == 12 ~nyfm$MT[12] )

Next we also want to give the actual names to the months in question, this will allow for the creation of better graphics later on. We also have to factor the names of the months. This essentially means that we are providing an order to the list of names. R does not automatically recognize that this list of names are the months of the year, thus if you go to plot the months it will do so in alphabetical not chronological order.

nydf$month_name<-
  case_when(nydf$MONTH == 1 ~"January",
            nydf$MONTH == 2 ~"February",
            nydf$MONTH == 3 ~"March",
            nydf$MONTH == 4 ~"April",
            nydf$MONTH == 5 ~"May",
            nydf$MONTH == 6 ~"June",
            nydf$MONTH == 7 ~"July",
            nydf$MONTH == 8 ~"August",
            nydf$MONTH == 9 ~"September",
            nydf$MONTH == 10 ~"October",
            nydf$MONTH == 11 ~"November",
            nydf$MONTH == 12 ~"December")
nydf$month_name<- factor(nydf$month_name, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

The next column is going to calculate the departure from monthly mean temperatures in any given month. In other words, the mean January temperatures for the whole data set will be subtracted from the January of any given year, lets say 2010 for arguments sake. The result is an anomaly or departure of 0.24 Celsius. The same is done for every month of every year.

nydf$diff<- nydf$meanT - nydf$monthly

Now im going to replicate the monthly mean and anomaly code but do it for each year as a whole.

ny_yr<- queens_df %>%
  filter(YEAR > 1949) %>%
  group_by(YEAR) %>%
  summarize(Temp = mean(TEMP), P = sum(PRCP))

ny_yr$mean<- mean(ny_yr$Temp) #Calculates the mean of all of the years
ny_yr$diff<- ny_yr$Temp - ny_yr$mean #Creates a column of the departure from the average calculated above

Finally, I trimmed the data to start at 1950 for the sake of simplicity. You can start this as far back as you would like.

qdf_slim<- nydf %>%
  filter(YEAR >= 1950)

Creating Graphics

The first figure I want to make is a simple line graph of the yearly mean temperatures for the whole record.

ggplot(ny_yr) + geom_line(aes(x = YEAR, y = Temp)) + 
  theme_classic()+ 
  xlab("Year") + 
  ylab("Temperature (°C)") + 
  scale_x_continuous(breaks = c(1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)) 

Right, there is certainly a great deal of inter-annual variability but the trends are clear, average temperatures have been increasing over the last 70 years. There is room for more discussion about the drivers of this variability, most of which are natural climate variability (solar cycles, ENSO, NAO, volcanic eruptions, etc), but we would be here all day and I want to get to the neat graph.

lg_main<- ggplot(qdf_slim)  + geom_point(aes( x = YEAR , y = meanT, color = diff), alpha = 0.9) +
  scale_color_gradient2(midpoint = 0, high = "red", mid = "gray", low = "blue") + 
  geom_smooth(aes(x = YEAR, y = meanT), method = lm, color = "black") + 
  facet_grid(. ~month_name) + 
  theme_classic()+ 
  xlab("Year") + 
  ylab("Temperature (°C)") + 
  scale_x_continuous(breaks = c(1950, 1985, 2022)) + 
  scale_y_continuous(breaks = c(-5, 0, 5, 10, 15, 20, 25, 30)) + 
  theme(axis.text.x = element_text(angle = 70, vjust = 0.5, hjust=0.5)) +
  labs(colour="Temp Anom (°C)", title = "Monthly Mean Temperatures @ LaGuardia Airport 1950-2021", caption = "Source: GSOD Stations 725030-14732 & 999999-14732")
lg_inset<- ggplot(ny_yr) + geom_point(aes(x = YEAR, y = Temp)) + 
  geom_smooth(aes(x = YEAR, y = Temp), method = lm, color = "black") + 
  theme_classic()+ 
  xlab("Year") + 
  ylab("Temperature (°C)") + 
  scale_x_continuous(breaks = c(1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)) 
#Use code below to save the file to your device, be sure to update the pathname to you desired location on your cdevice
#ggsave("pathname", width = 30, height = 15, units = "cm")

Breif Note on Precipitation

So the temperature graph came out quite nice, however, the precip data is much more sparse. While we can do the same kind of figure, it looks a bit, ehh.

lg_main_p<- ggplot(qdf_slim)  + geom_point(aes( x = YEAR , y = P), alpha = 0.9) +
  geom_smooth(aes(x = YEAR, y = P), method = lm, color = "black") + 
  facet_grid(. ~month_name) + 
  theme_classic()+ 
  xlab("Year") + 
  ylab("Precipitation (mm)") + 
  scale_x_continuous(breaks = c(1950, 1985, 2021)) + 
  scale_y_continuous(breaks = c(-5, 0, 5, 10, 15, 20, 25, 30)) + 
  theme(axis.text.x = element_text(angle = 70, vjust = 0.5, hjust=0.5)) +
  labs(colour="Temp Anom (°C)", title = "Monthly Precipitation @ LaGuardia Airport 1950-2022", caption = "Source: GSOD Stations 725030-14732 & 999999-14732")
lg_inset_p<- ggplot(ny_yr) + geom_point(aes(x = YEAR, y = P)) + 
  geom_smooth(aes(x = YEAR, y = P), method = lm, color = "black") + 
  theme_classic()+ 
  xlab("Year") + 
  ylab("Temperature (°C)") + 
  scale_x_continuous(breaks = c(1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)) 
lg_plot_p<- lg_main_p + inset_element(lg_inset_p, 0.01, 0.6, 0.3, 0.9)
lg_plot_p
## Warning: Removed 156 rows containing non-finite values (stat_smooth).
## Warning: Removed 156 rows containing missing values (geom_point).
## Warning: Removed 17 rows containing non-finite values (stat_smooth).
## Warning: Removed 17 rows containing missing values (geom_point).

By nature precipitation is more variable than temperature. This is made worse by the fact that the reliable precipitation data at this site is half the length of the temperature data. Most trends barely exist and if they do are not statistically significant. Brendan Wiltse had a bit more success with precipitation data in his longer data set. Recall that this graph can be made for any data set that you can find, whether its short, long, local, regional, or global. It is mainly good for visualizations of long term monthly trends in climate parameters, but nothing will replace regression analysis and traditional time series.