
Extending Data with APIs
- 12 minsOverview
In this post I will cover how to extend an existing dataset using publicly available APIs. We may want to do this for a number of reasons, namely engineering new features for a machine learning model, or minimizing the amount of data we collect/store.
I. Design of Data
For this exercise I created two sample datasets. One dataset contains fishing locations which are popular in my area. The second dataset simulates a personal catch log containing what type of fish was caught and where we caught it. Because most people dont like to mess around with more technology while they are fishing I wanted to try and keep the table dimensions low and leave the rest of the information for the program to collect (things like weather, tide, etc). Below I have created an example table showing what both datasets will look like.
Personal Catch Table
This is what the data that the user will be inputting will look like. Again I don’t want the program to rely on too much info, the four things should be all you need to keep track of because the software should handle figuring out the weather data, etc.
Species | Latitude | Longitude | Datetime |
---|---|---|---|
Striped Bass | 41.038999 | -71.916319 | Fri, 6 Jun 2018 17:18:56 -0500 |
Fluke | 41.012193 | -72.063143 | Sun, 21 Jul 2018 00:18:56 -0500 |
Fishing Locations Table
This is what the second dataset will look like, it is a table containing popular fishing locations. The Target column specifies exactly which type of fish can be found there.
Location | Latitude | Longitude | Target |
---|---|---|---|
Montauk Point (The Elbow) | 41.070783 | -71.854206 | Striped Bass, Black Seabass, Bluefish |
Blackfish Rock | 41.079974 | -71.882730 | Striped Bass, Blackfish |
Ditch Plains | 41.038796 | -71.916757 | Striped Bass |
Fort Tyler (The Ruins) | 41.141555 | -72.145713 | Striped Bass, Fluke, Bluefish |
Plum Gut | 41.165869 | -72.215735 | Striped Bass |
Navy Beach Fishing Pier | 41.044804 | -71.977920 | Striped Bass, Black Seabass, Bluefish |
Georgica Beach (1st Jetty) | 40.933467 | -72.216431 | Striped Bass |
Georgica Beach (White House) | 40.939470 | -72.201378 | Striped Bass |
Green Lawns | 41.066667 | -72.383333 | Fluke |
II. Create Some Sample Data
Later on we will have much more data and use a database to store it. But beacuse I want to focus on the best ways to use python and not enter data for this assignment, I just want to do a proof-of-concept, so I will start by populating some pandas dataframes with sample data like the ones above. I thought about coding in the data into a pandas dataframe manually however I figured that since I spent time formatting the data above in markdown I may aswell use excel to convert these tables to small sample_*.csv files.
# Get a list of data_files in the data folder
dir_list = os.listdir("data/")
# .DS_Store is a hidden file found in directories on the macOS
# -operating system
if '.DS_Store' in dir_list:
dir_list.remove(".DS_Store")
# Import locations csv
locationsDF = pd.read_csv('data/'+dir_list[0],index_col=False
, header=0,error_bad_lines=False)
# Import personal csv
personalDF = pd.read_csv('data/'+dir_list[1],index_col=False
, header=0,error_bad_lines=False)
# Make sure data is good
locationsDF.head(2)
personalDF.head(2)
Extending the Data with APIs
For this part I want to take advantage of our tax dollars and use the NOAA API, Weather.gov API, and others for retreiving extremely detailed and accurate weather and tide data. Because we have a date/time and location (Latitude and Longitude) we can easily get other derived from these few data points. This is great because we dont have to worry about maintaining an entire database with tides, weather, and other information which there are entire organizations dedicated to collecting.
I. Weather.gov API
For this part I will be working with the NOAA and Weather.gov API, there are two primary steps which are to first convert the data in the dataframe to a format accepted by the API, that way we can completley automate the process of looking up the extra information once the fisherman uploads their information. The second step is to actually query the API and append the returned data to a dataframe with proper formatting. I want to stick with using the fishing locations table instead of the personal catches because personal catches will require a bit more work to make it functional mostly because it involves looking up historical information where finding an API can be difficult.
Use API to add Todays Temp to Dataframe
As a first exercise to practice using the new library ‘requests’ as well as parsing the json output I simply want to request the current temperature from weather.gov at each of the locations.
tempr_list = []
# for loop to loop through dataframe
for index, row in locationsDF.iterrows():
# Define api request parameters in payload
payload = str(row['Latitude'])[:9]+','+str(row['Longitude'])[:10]
#print("Payload: ",payload)
# Define URL which in our case is very simple
url = "https://api.weather.gov/points/{0}/forecast".format(payload)
#print("url: ",url)
data = requests.get(url)
json_data = json.loads(data.text)
if 'properties' not in json_data:
result = "API Missing Data"
#print("Lat: ",str(row['Latitude'])[:9],"Lon: ",str(row['Longitude'])[:10], "Error Collecting Data For")
else:
result = json_data['properties']['periods'][0]['temperature']
# Append temperature rsults to temp list
tempr_list.append(result)
# Append list as column to dataframe
locationsDF['Todays Temperature (F)'] = tempr_list
New Dataframe with Temperatures
Very cool! Based on the dataframe below it looks like the API failed to return data for some locations. I will have to look into this further because I feel that there may be an issue with how the coordinates are being sent to the API as I doubt it really doesnt have data on those areas. For now I will push forward with the proof of concept and either choose a different API for the final project or hopefully fix the issue. For now we have added a column titled “Today’s Temperature (F)” for Temperature in fahrenheit.
locationsDF.head(10)
Use API to add Other Important Data to Dataframe
Temperature is great but there are so many other datapoints a very experienced fisherman knows to look at before choosing to fish somewhere. Let’s try to use some more API calls to either NOAA or Weather.gov to get some of those data points. I also wrote in a function this time so we can see which locations are returning empty data from the API.
windspd_list = []
winddirec_list = []
fcast_list = []
# for loop to loop through dataframe
for index, row in locationsDF.iterrows():
# Payload
payload = str(row['Latitude'])[:9]+','+str(row['Longitude'])[:10]
# Format url with desired payload (parameters)
url = "https://api.weather.gov/points/{0}/forecast".format(payload)
# Use requests library to get url
data = requests.get(url)
# Load results as json data
json_data = json.loads(data.text)
# Process results
if 'properties' not in json_data:
result_wspeed, result_wdirec, result_forcast = "API Missing Data","API Missing Data","API Missing Data"
print("Lat: ",str(row['Latitude'])[:7],"Lon: ",str(row['Longitude'])[:8],"ERROR COLLECTING DATA")
else:
result_wspeed = json_data['properties']['periods'][0]['windSpeed']
result_wdirec = json_data['properties']['periods'][0]['windDirection']
result_forcast = json_data['properties']['periods'][0]['shortForecast']
# Append temperature rsults to temp list
windspd_list.append(result_wspeed)
winddirec_list.append(result_wdirec)
fcast_list.append(result_forcast)
# Append list as column to dataframe
locationsDF['Todays Windspeed'] = windspd_list
locationsDF['Todays Wind Direction'] = winddirec_list
locationsDF['Todays Summary Forecast'] = fcast_list
New Dataframe with More Data
Now the dataset is starting to come together! This particular table could be really useful as it is. We could make a great dashboard of what the current fishing grounds are like before we head out in our boat/car. This is a huge improvement over tuning to a channel on your VHF radio and listening to the report! Additionally, we have some information now what we could use to make some predictive models if we so desired. Again we will have to fix the missing data problems however I think they can be corrected with a few hours of debugging.
locationsDF.head(5)
II. NOAA API
Now we will work with the NOAA API. Compared to the Weather.gov API the NOAA API provides oceanic data and data which is more relevant to fishing like water temp, barometric pressure, etc. All these indicators tend to drive the behavior of fish and are very critical for success. The only caveat with this part of the assignment is that NOAA only has a handful of CO-OPS beacons which collect data, additionally data from these beacons must be queried using the beacon id instead of latitude and longitude.
Use API to add Water Temp to Dataframe
As a first exercise to practice using the NOAA API and library ‘requests’ as well as parsing the json output I simply want to request the current water temperature at each of the locations and times of catches. I will eventually need to think of a solution for mapping these to lat/lon or the software will be un-usable outside the eastern long island area. I suspect the largest challenge for this part will be formatting the dates.
# station ID which is nearest to the fishing grounds
station_id = '8510560'
# Water Temp List
water_temps = []
# URL for API
url = 'http://tidesandcurrents.noaa.gov/api/datagetter?'
# for loop to loop through dataframe
for index, row in personalDF.iterrows():
# need to convert string to datetime_object ex. (Fri, 6 Jun 2018 17:18:56 -0500)
datetime_object = datetime.strptime(row['Datetime'], '%a, %d %b %Y %H:%M:%S %z')
date_to_send = datetime_object.strftime('%Y%m%d')
#print(date_to_send)
# Assemble payload
payload = {'station': station_id, 'end_date': date_to_send,
'range': '24', 'product':'water_temperature','format':'json',
'units':'english', 'time_zone':'lst'}
data = requests.get(url, params=payload)
json_data = json.loads(data.text)
water_temps.append(json_data['data'][0]['v'])
personalDF['Water Temp. (F)'] = water_temps
New Personal Catches Dataframe with Some New Data
Now our personal catch dataset is starting to come together! This is a great example of using historical data to lookup the water conditions. We looked up temperature in fahrenheit but there are many other parameters including tide that we could find. I will definitley be using this API in my final project and try to get some more data from it.
personalDF.head(5)
Conclusion
Now we have a great foundation for getting a ton of data based off just one or two columns! This alone is great because I no longer have to lookup all this information manually. Additionally there is virtually no loss in quality because most websites that provide things like tides, weather, etc. almost always use NOAA data anyway. For my final project I hope to use some of these methods I developed to build out a nice intel dashboard with tons of data available to me for use before and after my fishing trips.