Extracting Vital Statistics from the CIA World Factbook with SQL
Photo by Shlomi Platzman on Unsplash
Executive Summary
In this project, I utilized SQLite within Jupyter Notebook to extract demographic and geographic data from the CIA World Factbook. I identified China as the most populous locale tracked by the CIA, and the Pitcairn Islands as the least populous. Malawi had the highest rate of population growth, and Bulgaria had the highest rate of population contraction (both excluding immigration and emigration). Greenland had the lowest population density; Macau, the largest. Finally, Puerto Rico had the largest water-to-land-area ratio of those locales I included within the inclusion criteria.
The source code for this project can be found in this GitHub repository.
Introduction
The 'CIA World Factbook' is a datasource that provides vital statistics of 266 world 'locales'. Locales can be:
Countries
Regions
Oceans
Unoccupied Islands
In this project, I use SQLite to work with a *small subset* of the CIA factbook to identify the maximum and minimum values of:
Population
Population Growth
Population Density
Land-to-Water Ratio
Data Dictionary
In this data subset, the following columns are used:
code = a unique 2 letter locale code.
name = the name of the locale.
area = the total area of the locale in square km.
area_land = the land area of the locale in square km.
area_water = the area of sovereign water occupied by the locale in square km.
population = the total population of the locale.
population_growth = population growth rate in percent.
birth_rate = the number of births per 1,000 persons.
death_rate = the number of deaths per 1,000 persons.
migration_rate = the net migration rate per 1,000 persons. In this dataset, I am unable to determine if there is a correct sign on the migration rate, since it is not possible for all countries to have a migration rate >= 0.
Queries
Basic Statistics
To obtain some of the most basic statistical measures in this project, I performed a single basic query:
select min(population) as 'Lowest Population',
max(population) as 'Highest Population',
min(population_growth) as 'Slowest Growth',
max(population_growth) as 'Fastest Growth'
from facts
This provided the requested numerical values, but not the corresponding region. A cursory examination also revealed that the CIA factbook does not restrict its "regions" to political units. In fact, the region with the highest overall population is "The World." There were several different regions with the lowest population value of 0 including Antarctica.
Highest Population
To obtain more specific information, I performed additional queries with restrictions to avoid returning results such as "The World" or "Antarctica":
select name as 'Most Populous Locale',
max(population) as 'Population'
from facts
where population < (select max(population)
from facts)
limit 1;
This query correctly returned that China is the region with the highest population with 1,367,485,388 people.
Lowest Population
To find the region with the lowest non-zero population, I ran the following query.
select name as 'Least Populous Locale',
min(population) as 'Population'
from facts
where population > 0
limit 1;
This revealed that the Pitcairn Islands had the lowest reported population with only 48 residents.
Fastest and Slowest Population Growth
The following queries performed some quick calculations on the data to obtain information on net population growth (excluding immigration/emigration).
select name as 'Fastest Growing Locale',
round(max(birth_rate - death_rate),2) as 'Net population change per 1,000 persons'
from facts
select name as 'Slowest Growing Locale',
round(max(birth_rate - death_rate),2) as 'Net population change per 1,000 persons'
from facts
This revealed that Malawi was the fastest growing locale with a growth of 33.15 people per 1,000 population, and that Bulgaria was the slowest growing locale with a loss of 5.52 people per 1,000 population.
Population Density
I utilized some more involved calculations to identify population densities:
select name as 'Highest population density',
area_land as 'Land Area (sq km)',
population as 'Population',
round(max(cast(population as float) / area_land),2) as 'Persons per square km'
from facts
select name as 'Lowest population density',
area_land as 'Land Area (sq km)',
population as 'Population',
round(min(cast(population as float) / area_land),2) as 'Persons per square km'
from facts
where (cast(population as float) / area_land) > 0
order by (cast(population as float) / area_land)
The queries returned that Macau had the highest population density (21,169 persons per square kilometer); Greenland, the lowest (0.03 persons per square kilometer).
Water-to-land Ratio
The final queries I ran on this dataset were designed to identify the regions that had the highest water-to-land ratio in the dataset. Nominally, this value should vary from near infinity (a tiny island) to 0 (a landlocked country). In performing this calculation, I originally assumed that the Factbook presented "area_water" as the sum of Territorial Waters and Internal Waters. Generally, territorial waters extend 12 nautical miles beyond a country's coast.
A quick query revealed this was not the case:
select name as 'Locale',
area_land as 'Land Area (sq km)',
area_water as 'Water Area (sq km)'
from facts
where area_water = 0
Because in addition to landlocked countries such as Austria, this query returned many Pacific Island regions, such as the Maldives, Marshall Islands, and the Federated States of Micronesia. This suggests that the Factbook does not report the territorial waters as described above. Without a technique to accurately address this missing data, I decided to exclude locales with BOTH "area_water" = 0 and "area_land" < 1,000. These values were designed to filter out any small islands with zero reported water area.
This filtering process also excluded any truly landlocked countries from analysis.
select name as 'Locale',
area_land as 'Land Area (sq km)',
area_water as 'Water Area (sq km)',
round((cast(area_water as float) / area_land),5) as 'Water-Land Ratio'
from facts
where (round((cast(area_water as float) / area_land),5) != 0)
and (area_land > 1000)
order by round((cast(area_water as float) / area_land),5) desc
limit 5
This query returned the top five regions by water-to-land ratio:
Puerto Rico (0.555)
The Bahamas (0.386)
Guinea-Bissau (0.285)
Malawi (0.259)
The Netherlands (0.226)
Conclusions
In this project, I utilized elementary SQL queries to extract population and land-area data from the CIA World Factbook. I identified the following:
Population
China is the world's largest country by population
The Pitcairn Islands is the smallest locale (by population) tracked in the CIA Factbook.
Population Growth
Malawi has the highest rate of population growth in the world (excluding immigration and emigration)
Bulgaria has the lowest rate of population growth in the world (excluding immmigration and emigration)
Population Density
Greenland has the world's lowest population density
Macau has the world's highest population density
Puerto Rico has the world's highest water-to-land ratio.