Mark Di Marco
This post will cover creating an interactive, 3D geographical data visualization, with some raw data, 30 lines of basic Python, 35 lines of Javascript, and a few helpful bash commands. Source code on Github can be found here, and a live demo can be viewed here.
The data we'll use comes from the THOR dataset published by data.mil, the U.S. Defense Digital Service, available for download on data.world.
The data is in CSV format, containing 4.6 million rows and 47 columns. It covers air missions by coalition forces from the years 1965 to 1975. U.S. military operations in Vietnam largely ended in 1973.
The 4.6 million rows of data available for Vietnam contains all types of air missions - including bombings, recon, psy-ops, troop deployments, and supply drops. Our visualization is meant to cover bombings in particular, so we'll have to remove some data to isolate these mission types. The following is a sampling of the THOR data:
thor_data_viet_id | countryflyingmission | milservice | msndate | sourceid | sourcerecord | valid_aircraft_root | takeofflocation | tgtlatdd_ddd_wgs84 | tgtlonddd_ddd_wgs84 | tgttype | numweaponsdelivered | timeontarget | weapontype | weapontypeclass | weapontypeweight | aircraft_original | aircraft_root | airforcegroup | airforcesqdn | callsign | flthours | mfunc | mfunc_desc | missionid | numofacft | operationsupported | periodofday | unit | tgtcloudcover | tgtcontrol | tgtcountry | tgtid | tgtorigcoords | tgtorigcoordsformat | tgtweather | additionalinfo | geozone | id | mfunc_desc_class | numweaponsjettisoned | numweaponsreturned | releasealtitude | releasefltspeed | resultsbda | timeofftarget | weaponsloadedweight | year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
404557 | UNITED STATES OF AMERICA | USN | 1968-06-21 | 4757386 | CACTA6669 | A-7 | WESTPAC 66 | 18.38 | 105.744 | TRUCK PARK\STOP | 0 | 120.0 | 500LB GP MK-82 | 500 | A7A | A7 | 18 | 1.0 | STRIKE | A0002 | 1 | ROLLING THUN | D | N09122 | PREBRIEFED TGT | NORTH VIETNAM | 18380N105744E | DD.DDDN DDD.DDDE | 1309H | MISSION: RT57C - CONTROL_TYPE: PREBRIEFED TGT - RTCOR: CORPS AREA II (IN-COUNTRY MISSIONS) OR RTE AREA 2 (OUT-OF-COUNTRY MISSIONS) | WF | 26299335 | KINETIC | 900 | 0 | 120.0 | -1 | 1968 | |||||||||
404616 | UNITED STATES OF AMERICA | USN | 1968-10-11 | 4757606 | CACTA6669 | F-4 | WESTPAC 43 | 18.466666 | 105.766666 | AREA\DEPOT | 0 | 0.0 | SIDEWINDER | 164 | F4B | F4 | 17 | 35.0 | ESCORT/COVER | L0035 | 1 | D | N09558 | PREBRIEFED TGT | NORTH VIETNAM | 1828N10546E | DDMMN DDDMME | MISSION: UERT614B - CONTROL_TYPE: PREBRIEFED TGT - RTCOR: CORPS AREA II (IN-COUNTRY MISSIONS) OR RTE AREA 2 (OUT-OF-COUNTRY MISSIONS) | WF | 26299395 | NONKINETIC | 0 | 4 | 0.0 | -1 | 1968 | |||||||||||
404560 | UNITED STATES OF AMERICA | USN | 1968-04-09 | 4757394 | CACTA6669 | A-4 | WESTPAC 31 | 16.675 | 106.667 | TROOPS | 152 | 210.0 | 2.75IN HE RKT | 18 | A4F | A4 | 76 | 6.0 | DIRECT AIR SUPPORT | B0008 | 4 | IN COUNTRY | D | N09933 | ABN FWD AIR CNTL | SOUTH VIETNAM | 16675N106667E | DD.DDDN DDD.DDDE | 9996H | MISSION: VS - CONTROL_TYPE: ABN FWD AIR CNTL - RTCOR: CORPS AREA I (IN-COUNTRY MISSIONS) OR RTE AREA 1 (OUT-OF-COUNTRY MISSIONS) | XC | 26299338 | KINETIC | 0 | 0 | 220.0 | -1 | 1968 | |||||||||
404564 | UNITED STATES OF AMERICA | USN | 1968-09-07 | 4757409 | CACTA6669 | A-7 | WESTPAC 66 | 18.977777 | 105.447222 | RADAR | 0 | 1455.0 | MK-11/MK-12 | 57 | A7A | A7 | 0 | 1.0 | STRIKE | A0031 | 1 | ROLLING THUN | N | N09943 | TGT OF OPPORTUNITY | NORTH VIETNAM | 185840N1052650E | DDMMSSN DDDMMSSE | 9995H | MISSION: RTIH57C - CONTROL_TYPE: TGT OF OPPORTUNITY - RTCOR: CORPS AREA III (IN-COUNTRY MISSIONS) OR RTE AREA 3 (OUT-OF-COUNTRY MISSIONS) | WF | 26299342 | KINETIC | 0 | 10 | 1456.0 | -1 | 1968 | |||||||||
404565 | UNITED STATES OF AMERICA | USAF | 1968-08-10 | 4757412 | CACTA6669 | A-37 | BIEN HOA | 10.1113705 | 105.9828826 | CAMP | 0 | 215.0 | 7.62MM | 6 | A37A | A37 | 22 | 3.0 | AIR INTERDICTION | JA009 | 2 | IN COUNTRY - SOUTH VIETNAM | D | JAPNU0 | ABN FWD AIR CNTL | SOUTH VIETNAM | 48PXS08001780 | ##ABC11112222 | 9999N | MISSION: VS08-5840 - CONTROL_TYPE: ABN FWD AIR CNTL - RTCOR: CORPS AREA IV (IN-COUNTRY MISSIONS) OR RTE AREA 4 (OUT-OF-COUNTRY MISSIONS) | XS | 26299343 | KINETIC | 0 | 24 | 229.0 | -1 | 1968 | |||||||||
404572 | UNITED STATES OF AMERICA | USAF | 1968-12-14 | 4757454 | CACTA6669 | F-4 | UBON AB | 16.771 | 106.117 | SEGMENT\TRANS ROUTE | 12 | 845.0 | 0 | F4D | F4 | 22 | 1.0 | STRIKE | JV006 | 2 | STEEL TIGER - BALDIE | D | JALJI0 | ABN FWD AIR CNTL | LAOS | 16771N106117E | DD.DDDN DDD.DDDE | MISSION: SL349 - CONTROL_TYPE: ABN FWD AIR CNTL - RTCOR: RTE AREA F (OUT-OF-COUNTRY MISSIONS) | XC | 26299350 | KINETIC | 0 | 0 | 846.0 | -1 | 1968 |
There are a handful of columns that we can use to infer which rows to include and which rows to filter. numweaponsdelivered
looked interesting up front, but it has non-zero values for all different types of missions, including recon and troop deployments.
mfunc_desc
has definitive values, like STRIKE
and AIR INTERDICTION
, but while digging through these values, there seemed to be a high cardinality in the number of distinct values, and our filtering method would have to maintain that list.
I ended up using a combination of mfunc_desc_class
(distinct values are only KINETIC
and NONKINETIC
), numweaponsdelivered
when the value was greater than 0, and a check to exclude vlaues without target latitude and longitude in columns tgtlatdd_ddd_wgs84
and tgtlondd_ddd_wgs84
.
1AWK is a text processing command line program that's been around since 1977, more information is available at AWK Wiki
A fresh install of MacOS or Linux come well equiped with the commands we'll need to filter our data, namely using the awk
1 program
With the data downloaded on my desktop, we'll filter down the 1.6 gigabyte CSV file with a single command.
awk -F, '$9 != "" && $40 != "NONKINETIC" && $12 != 0 {print}' thor_data_vietnam.csv > thor_filtered.csv
The above command says: run awk, set the column separator to a comma, if column 9 (numweaponsdelivered
) is not zero, and column 40 (mfunc_desc_class
) is not NONKINETIC
, and column 12 (tgtlatdd_ddd_wgs84
) is not empty, print the line to a file named thor_filtered.csv
.
2wc -l
is the standard word count program UNIX-based systems ship with. The -l
option asks for a count of lines.
Running wc -l thor_filtered.csv
2 shows we've filtered the 4.6 million rows to 1.95 million rows, and the 1.6 gigabytes file is now 735 megabytes.
3For posterity, gzip -c thor_filtered.csv | wc -c
shows 133 megabytes after gzipping.
The goal for this project is to be an interactive browser-based visualization, so shipping 735 megabytes of data is out of the question, even if gzipping
the file pares down the size3.
We need a way to aggregate this data before sending it to the browser. We know these 1.95 million rows all happen in a relatively small geographical area, so ideally we could find a way to cluster similar entries based on GPS coordinates. Since we'll be showing a hexgrid visualization, each hex in the visualization might encompass many square kilometers.
There exists a library called H3 that is perfect for our use case. H3 is a "Hexagonal Hierarchical Geospatial Indexing System", which provides an API that takes a latitude, longitude, and a resolution, and returns back a string that represents the bucket for that entry. Increasing the resolution parameter increases the number of buckets.
4 H3's bucketing algorithm is undestandbly lossy. Running the reverse command, h3.h3_to_geo
won't give you the same coordinates you put in, but as you increase the resolution, the results will get closer and closer.
Using H3's Python binding (pip install h3
), we can test it with h3.geo_to_h3(0, 0, 5)
, outputs '85754e67fffffff'
and h3.geo_to_h3(14, -100, 9)
which outputs '896da150217fffff'
4
Now we can write a simple Python program that reads the 1.95 million row CSV file one row at a time, computes the H3 coordinate, and aggregates the values in a simple Python dictionary object, and then serializes that object out to JSON for easy browser consumption (browsers require third-party libraries to parse CSV).
To create a JSON file based on this program, all you have to do is run python make_h3.py > hexed.json
. The resulting file is 2.8 megabytes, with 53,330 distinct buckets after having filtered out all of the values with only 1 entry to reduce some of the noise.
Each entry in the JSON array has a hex
key, which encodes the coordinates and bucket, and a count
key, which is an aggregate of the number of rows that our Python script placed into this bucket.
Now that we have the 1.95 gigabytes of data, filtered and aggregated down to 2.8 megabytes, we're ready to create a visualization.
The Javascript library deck.gl is going to handle the heavy lifting here. It wraps Mapbox and allows you to define WebGL-friendly visualization layers on top of the Mapbox tiles.
5This is no coincidence, Uber's open-source team maintains both deck.gl and h3.
By default, deck.gl ships with support for a large number of visualizations, including Hexgrids. Even more convienent, deck.gl comes with a H3 specific visualization layer, called deck.H3HexagonLayer
. We'll use this layer to draw our visuals5
6If not using Python 3, python -m SimpleHTTPServer 8080
Assuming that hexed.json
, map.html
and index.js
are all in the same directory, you can run python -m http.server 8080
6 and navigate to localhost:8080/map.html to see the map, or checkout the source code on Github.
You can see this visualization running live at here. You can pan around, zoom in and out, and rotate the camera (ctrl-click-and-drag).
The THOR dataset also covers WW1, WW2, and the Korean War. The Vietnam War data is the most exhaustive by far. The datasets have pretty limitless potential, a lot of incredibly interesting data is in there. This visualization is just scratching the surface. I hope you find this interesting enough to dig into the data yourself.
Thanks to Uber's team for deck.gl and h3, and Dave Liepmann for tufte.css, and my employer data.world for introducing me to this data and making it free to download and use.