Extracting Vital Statistics from the CIA World Factbook with SQL

Photo by Shlomi Platzman on Unsplash  

Executive Summary

In this project, I utilized SQLite within Jupyter Notebook to extract demographic and geographic data from the CIA World Factbook. I identified China as the most populous locale tracked by the CIA, and the Pitcairn Islands as the least populous. Malawi had the highest rate of population growth, and Bulgaria had the highest rate of population contraction (both excluding immigration and emigration). Greenland had the lowest population density; Macau, the largest. Finally, Puerto Rico had the largest water-to-land-area ratio of those locales I included within the inclusion criteria.

The source code for this project can be found in this GitHub repository.

Introduction

The 'CIA World Factbook' is a datasource that provides vital statistics of 266 world 'locales'. Locales can be:

In this project, I use SQLite to work with a *small subset* of the CIA factbook to identify the maximum and minimum values of:

Data Dictionary

In this data subset, the following columns are used:

Queries

Basic Statistics

To obtain some of the most basic statistical measures in this project, I performed a single basic query:

select min(population) as 'Lowest Population',

       max(population) as 'Highest Population',

       min(population_growth) as 'Slowest Growth',

       max(population_growth) as 'Fastest Growth'

  from facts

This provided the requested numerical values, but not the corresponding region. A cursory examination also revealed that the CIA factbook does not restrict its "regions" to political units. In fact, the region with the highest overall population is "The World." There were several different regions with the lowest population value of 0 including Antarctica.

Highest Population

To obtain more specific information, I performed additional queries with restrictions to avoid returning results such as "The World" or "Antarctica":

select name as 'Most Populous Locale',

       max(population) as 'Population'

  from facts

 where population < (select max(population)

                       from facts)

 limit 1;

This query correctly returned that China is the region with the highest population with 1,367,485,388 people.

Lowest Population

To find the region with the lowest non-zero population, I ran the following query.

select name as 'Least Populous Locale',

       min(population) as 'Population'

  from facts

 where population > 0

 limit 1;

This revealed that the Pitcairn Islands had the lowest reported population with only 48 residents.

Fastest and Slowest Population Growth

The following queries performed some quick calculations on the data to obtain information on net population growth (excluding immigration/emigration).


select name as 'Fastest Growing Locale',

       round(max(birth_rate - death_rate),2) as 'Net population change per 1,000 persons'

  from facts


select name as 'Slowest Growing Locale',

       round(max(birth_rate - death_rate),2) as 'Net population change per 1,000 persons'

  from facts

This revealed that Malawi was the fastest growing locale with a growth of 33.15 people per 1,000 population, and that Bulgaria was the slowest growing locale with a loss of 5.52 people per 1,000 population.

Population Density

I utilized some more involved calculations to identify population densities:


select name as 'Highest population density',

       area_land as 'Land Area (sq km)',

       population as 'Population',

       round(max(cast(population as float) / area_land),2) as 'Persons per square km'

  from facts


select name as 'Lowest population density',

       area_land as 'Land Area (sq km)',

       population as 'Population',

       round(min(cast(population as float) / area_land),2) as 'Persons per square km'

  from facts

 where (cast(population as float) / area_land) > 0

 order by (cast(population as float) / area_land)

The queries returned that Macau had the highest population density (21,169 persons per square kilometer); Greenland, the lowest (0.03 persons per square kilometer).

Water-to-land Ratio

The final queries I ran on this dataset were designed to identify the regions that had the highest water-to-land ratio in the dataset. Nominally, this value should vary from near infinity (a tiny island) to 0 (a landlocked country). In performing this calculation, I originally assumed that the Factbook presented "area_water" as the sum of Territorial Waters and Internal Waters. Generally, territorial waters extend 12 nautical miles beyond a country's coast.

A quick query revealed this was not the case:


select name as 'Locale',

       area_land as 'Land Area (sq km)',

       area_water as 'Water Area (sq km)'

  from facts

 where area_water = 0

Because in addition to landlocked countries such as Austria, this query returned many Pacific Island regions, such as the Maldives, Marshall Islands, and the Federated States of Micronesia. This suggests that the Factbook does not report the territorial waters as described above. Without a technique to accurately address this missing data, I decided to exclude locales with BOTH "area_water" = 0 and "area_land" < 1,000. These values were designed to filter out any small islands with zero reported water area.

This filtering process also excluded any truly landlocked countries from analysis.


select name as 'Locale',

       area_land as 'Land Area (sq km)',

       area_water as 'Water Area (sq km)',

       round((cast(area_water as float) / area_land),5) as 'Water-Land Ratio'

  from facts

 where (round((cast(area_water as float) / area_land),5) != 0)

       and (area_land > 1000)

 order by round((cast(area_water as float) / area_land),5) desc

 limit 5

This query returned the top five regions by water-to-land ratio:

Conclusions

In this project, I utilized elementary SQL queries to extract population and land-area data from the CIA World Factbook. I identified the following: