Blog #2 – Where’s the Water?
Most of the research I do for my day job involves water. Most science has a connection to water in one way or another, because water touches our lives in so many ways, and we aim to protect it. Some examples of water related science research:
Chemistry: measuring chemicals in water
Mercury
Cadmium
Biology: measuring for biological harms
Cyanotoxins from algae
E. coli
Other human health concerns
Microplastics
Pharmaceuticals from water treatment plants
Ecology: protecting ecosystem services
Fish populations
Monitoring invasive species
There are many more, these are just the few that come to mind. When it comes to human health, ecosystem health, recreation, living from day to day, water is there. Water is precious, it gives us life.
It is common to find civilizations that form and grow along a riverside, because that is where settling happens. I’ve been curious about the frequency of this; do large groups of people settle away from waterbodies? It is likely less common than settling closer to a major surface water source, but does it happen at all? Many large communities today rely on groundwater as their main potable source, so it’s not impossible.
Let’s get into the data: what do we have?
Last week we found the populated areas shapefile from Natural Earth. I should note that any kind of spatial dataset that includes dots, lines, or polygons is referred to as a shapefile. I don’t want to bore you about raster vs. vector data (that’s lesson 1 of GIS 101). So, I’ll just let you know that both types work differently, and we have been using vector data. I say shapefile a lot, but I know a lot of non-map people might not be familiar with this word, but now you know that I’m referring to a spatial dataset that has a shape and an attribute table (Figure shows examples of vector data from wikimedia).
It is important to keep in mind that the populated areas shapefile does not include all settled areas, only populated areas that are either large or hold a local significance. The dataset includes scientific stations as well as historical sites like Chernobyl, which was assigned a population of zero. The smallest Admin-1 capital population in the dataset is in El Porvenir, Panama, with a population of 10 people, so size really doesn’t matter here. However, I think that this distinction will help us in some way, since the question that we are asking is: do populated areas ever pop up far away from major surface water sources? Including culturally significant places will likely not influence this question to strongly either way, and we will always be able to review the data afterwards to make sure. If it does end up influencing the results, we could seek out a more detailed city shapefile in a more localized area.
I want to narrow down the study site to reduce the amount of surface water data I need to collect. It’s not common to find high resolution data on a global scale. That means that even if there is a global surface water dataset, it might be generalized to only show major rivers and exclude tributaries or smaller rivers. I live in Canada (hi, come visit sometime, northern Ontario is particularly lovely this time of year), so that’s a pretty easy place to start. I know that the Government of Canada makes some shapefiles publicly available, so that’s where I looked first.
What I found initially is a shapefile made of polygons, 2 dimensional shapes that can be plotted in GIS that have a measurable area. This wasn’t what I expected because typically a dataset that includes rivers will be a line dataset, where each river is marked out by a line that represents the center-point of the river. This dataset combines rivers and lakes into one file, which is why they are provided as polygons. Polygon data are great for seeing how much area something covers, and we could use this to measure river and lake widths.
I still want line data though… So, I googled the same title that the Government of Canada used but replaced “polygons” with “lines” and I found it! Perfect!
What was interesting is that when I overlayed them, I found that the line shapefile (green) showed more rivers than the polygon shapefile did (pink/grey), and naturally, the line shapefile has no lake data, and is missing some major rivers that are particularly wide. What I took away from this is that the line shapefile includes rivers of smaller size and no lakes, while the polygon shapefile includes major rivers which have a decent width that can be reasonably depicted with a polygon.
Now we have to make some decisions about the question we are asking. Do we want to know how many cities/populated areas are near rivers of any size? Major rivers? Do we include lakes to make the question about all kinds of surface water? Luckily, we can answer any of these questions with the data that we have. To keep things simple for now, I think I’m just going to use the line shapefile we found (otherwise this blog will go even longer than it already has).
Now we will narrow down the cities dataset to only include cities from Canada. We can do this by clipping the shapefile to only include cities within the Canadian border, but I’m worried about any overlapping boundaries or populated areas that might be on an island outside of the main continental continuum. Clipping will take all of the features from within a boundary and make them a new file. I have a polygon of Canada that includes individual polygons for each province. We could add a buffer around the Canadian polygon to capture the cities, but I don’t want to accidentally include any American cities that are close to the border. Instead, I can once again go into the attribute table of the cities data and look for cities that have their sovereign nation (labelled “sov0name”) written as “Canada”. I used the ‘Select by Attributes’ tool once again to find where “sov0name” = “Canada”.
When you click on “sov0name”, you can click on “Get Unique Values” to then select the attribute that you want. Now we’ve got all of the Canadian cities selected:
We can export this data, but first I want to double check all of the borders to see if there were any stragglers, or any cities not included due to a clerical error.
In fact, I found two cities that were not included: Sault Ste. Marie (pronounced “Soo Saint Marie”), and Niagara Falls. The third dot that seems close is Buffalo, which is an American city.
Both of these cities are on the Canadian-American border, and when I looked into the metadata (using the inspect tool), I found that they were both listed as American cities. Upon further googling, I realized that both Canada and the US have cities with these names in the same location, but you can just cross over to the Canadian or American side of the cities. I can confirm that these cities are in Canada, as I’ve been to them both!
Niagara Falls - The falls, and a rainbow! Sault Ste. Marie - The bears
Interesting that the cities are defaulted to the US in the dataset…
Now then, I manually selected these cities in the attribute table to be included and exported them all into a new dataset.
Let’s add the rivers!
At first glance, we can see that there are a lot of cities near waterbodies. But how close does it need to be to be counted as a city on a waterbody? Also, looking at this, I’ve realized that I need to take the Oceans and the Hudson Bay into account… Let’s do that next week.
For now, the more specific question we are going to answer is: how many populated areas are located on or around a river or tributary of smaller size within Canada? The next step is to define “around”, how far does a city have to be from a river to say it is next to it? To account for the fact that each point likely represents the center point of any city, let’s say that the city is counted as being close to a river if the distance to the nearest river to 20 km. We could also try 10 km and 30 km to see if that makes a major difference in either direction, but for now we will stick to 20 km. A larger radius will have a bigger impact on smaller cities, but this is a good place to start.
How do we test proximity? We could try making buffers using the Buffer tool, which generates a
polygon around each river that has a set radius/distance from the center line. However, when I
tried using this tool, I was getting the 999999 error (which is the worst, because it means that the computer doesn’t know what the problem is…). I believe there might be something iffy with the data. I tried doing the buffer several different ways, but it was taking 5+ hours to process and then would come back with the error. GIS does this sometimes, it doesn't always go smoothly. So, I literally googled "proximity test GIS" and found that I could use the Near tool instead, which tests and measures the closest distance from a point to the nearest feature from a different shapefile. There is a huge online GIS community, so learning how to phrase your search is a very valuable tool for GIS users. I opened the Near tool, set the limit to 20 km, and any points outside of that radius were marked with a -1 in the attribute table.
I highlighted all the points assigned the -1 value and found that there are 51 populated areas further than 20 km away from a river source, at least the ones included in the line shapefile. Notably, a lot of the highlighted cities are close to a different type of surface water, like the coast or a lake. We can take a closer look at that next week.
I zoomed in on one city in Alberta, Hinton (circled above), to test the results, and sure enough, the closest river is further than 20 km away.
But when we add the surface water shapefile back in, we see that Hinton is sitting right on a river! So next time, we will take a look at how to use both the line and the polygon shapefiles to run this test again.
Comments