Support for the JSON data format after GSoC 2020

Support for the JSON data format after GSoC 2020#

Created: 2020-08-19

The Google Summer of Code (GSoC) 2020 will end with this August. I had the pleasure to mentor Abdallah Elshamy working on the implementation of the jsondecode() and jsonencode() functions for Octave. These function allow to convert JSON data strings to Octave Objects and vice versa.

Last week we pushed most of Abdallah’s work to the main Octave repository, but he is still working on the functions and hopefully still after GSoC 2020 is over. Now as it is very convenient to use the JSON functions, I gave them a try with larger JSON data. Some of the test cases I collected from the excellent nativejson-benchmark, but with focus on Octave. Another test by Abdallah has been carried out in June to test the compatibility for Matlab.

Only the running times for reading and writing JSON data are regarded in this benchmark.

The test environment is a laptop with

octave_version = version ()
octave_hg_id   = version ('-hgid')
octave_version = 7.0.0
octave_hg_id = 173807014259

The following JSON extensions for Octave are under test.

name

description

Octave (builtin)

Based on RapidJSON, reading DOM API.

octave-rapidjson

Based on RapidJSON, reading SAX API.

octave-jsonstuff

Based on RapidJSON, reading DOM API, writing m-file.

JSONio

Based on JSMN, writing m-file.

jsonlab

m-file only

The JSON test files are described in the following table.

name

size (byte)

description

citm_catalog.json

1,727,204

Structured data with mixed text and numeric.

canada.json

2,251,060

Numeric data set in GeoJSON format.

large-file.json

26,141,343

Structured data with mixed text and numeric.

Benchmark setup#

Create a directory to keep track of the mess.

mkdir ('benchmark');
cd ('benchmark');

Load the benchmark JSON files.

if (exist ('citm_catalog.json', 'file') ~= 2)
  urlwrite ( ...
    'https://github.com/RichardHightower/json-parsers-benchmark/raw/master/data/citm_catalog.json', ...
    'citm_catalog.json');
end

if (exist ('canada.json', 'file') ~= 2)
  urlwrite ( ...
    'https://github.com/mloskot/json_benchmark/raw/master/data/canada.json', ...
    'canada.json');
end

if (exist ('large-file.json', 'file') ~= 2)
  urlwrite ( ...
    'https://github.com/json-iterator/test-data/raw/master/large-file.json', ...
    'large-file.json');
end

Setup octave-rapidjson.

if (exist ('octave-rapidjson', 'dir') == 0)
  urlwrite ( ...
    'https://github.com/Andy1978/octave-rapidjson/archive/2d88511712032b14dea4c2272d82249e7547772a.zip', ...
    'octave-rapidjson.zip');
  unzip  ('octave-rapidjson.zip');
  rename ('octave-rapidjson-2d88511712032b14dea4c2272d82249e7547772a', ...
          'octave-rapidjson');
  cd ('octave-rapidjson')
  urlwrite ( ...
    'https://github.com/Tencent/rapidjson/archive/35e480fc4ddf4ec4f7ad34d96353eef0aabf002d.zip', ...
    'rapidjson.zip');
  unzip  ('rapidjson.zip');
  rename ('rapidjson-35e480fc4ddf4ec4f7ad34d96353eef0aabf002d', 'rapidjson');
  mkoctfile -Wall -Wextra -I./rapidjson/include load_json.cc
  mkoctfile -Wall -Wextra -I./rapidjson/include save_json.cc
  cd ('..')
end

Setup octave-jsonstuff.

if (isempty (pkg ('list', 'jsonstuff')))
  pkg install https://github.com/apjanke/octave-jsonstuff/releases/download/v0.3.3/jsonstuff-0.3.3.tar.gz
end

Setup JSONio.

if (exist ('JSONio', 'dir') == 0)
  urlwrite ( ...
    'https://github.com/gllmflndn/JSONio/archive/6c699a315ac2c578864d8b740a061bff47b718bf.zip', ...
    'JSONio.zip');
  unzip  ('JSONio.zip');
  rename ('JSONio-6c699a315ac2c578864d8b740a061bff47b718bf', 'JSONio');
  cd ('JSONio')
  mkoctfile --mex jsonread.c jsmn.c -DJSMN_PARENT_LINKS
  cd ('..')
end

Setup jsonlab.

if (exist ('jsonlab', 'dir') == 0)
  urlwrite ( ...
    'https://github.com/fangq/jsonlab/archive/d0fb684bd43165d312063345bdb795b628b2c679.zip', ...
    'jsonlab.zip');
  unzip  ('jsonlab.zip');
  rename ('jsonlab-d0fb684bd43165d312063345bdb795b628b2c679', 'jsonlab');
end

Benchmark run#

The benchmark function reads the respective JSON file into a string and calls the libraries reading and writing function.

function t = benchmark (json_read_fcn, json_write_fcn)
  test_files = {'citm_catalog.json', 'canada.json', 'large-file.json'};
  N = length (test_files);
  t = nan (N, 2);
  for i = 1:N
    json_str = fileread (test_files{i});
    tic ();
    octave_obj = json_read_fcn (json_str);
    t(i,1) = toc ();
    tic ();
    json_str2 = json_write_fcn (octave_obj);
    t(i,2) = toc ();
  end
end

The results for the Matlab (R2020b, prerelease) have been measured on the same system without JupyterLab.

t.matlab = [
  0.0768, 0.0853;
  0.1510, 0.5405;
  1.2222, 0.6521];

Octave (7.0.0, development version)

t.octave = benchmark (@jsondecode, @jsonencode);

octave-rapidjson

addpath ('octave-rapidjson')
t.rapid_json = benchmark (@load_json, @save_json);
rmpath ('octave-rapidjson')

octave-jsonstuff: No results due to an error.

%pkg load jsonstuff
%t.jsonstuff = benchmark (@jsondecode, @jsonencode);
%error: cat: field names mismatch in concatenating structs
%error: called from
%    jsondecode>condense_decoded_json_recursive at line 116 column 9
%    jsondecode>condense_decoded_json at line 67 column 7
%    jsondecode at line 63 column 7
%    benchmark at line 8 column 16
%pkg unload jsonstuff

JSONio: Because of the long running time, the results of the first run are saved here.

addpath ('JSONio')
%t.jsonio = benchmark (@jsonread, @jsonwrite);
t.jsonio = [ ...
  0.9583,  30.5410;
  6.1333,  17.4022;
  4.3382, 552.8929];
rmpath ('JSONio')

Jsonlab: Because of the long running time, the results of the first run are saved here.

addpath ('jsonlab')
%t.jsonlab = benchmark (@loadjson, @savejson);
t.jsonlab = [ ...
   35.6242,  26.0625;
    6.1303,   0.7365;
  372.2456, 601.5318];
rmpath ('jsonlab')

Benchmark results#

Update 2020-08-29: Abdallah found out that the speed problem (described blow) was the call to matlab.lang.makeValidName not the chosen DOM API.

graphics_toolkit ('qt')
titles = {'citm\_catalog.json (2 MB, mixed)', ...
          'canada.json (2 MB, numeric)', ...
          'large-file.json (26 MB, mixed)'};
for i = 1:3
  subplot (3, 1, i);
  bar ([t.matlab(i,:); t.octave(i,:); t.rapid_json(i,:)]');
  legend ({'Matlab (R2020b, pre)', 'Octave (7.0.0, dev)', ...
           'octave-rapidjson'}, 'Location', 'bestoutside');
  xticklabels({'read','write'});
  ylabel ('time in seconds');
  title (titles{i});
end
../../../../_images/9a4a5760418f14f0e4d9e129fd9c2d05256ecb59de825e49dd205e4d53e34586.png
for i = 1:3
  subplot (3, 1, i);
  bar ([t.jsonio(i,:); t.jsonlab(i,:)]');
  legend ({'JSONio', 'jsonlab'}, 'Location', 'bestoutside');
  xticklabels({'read','write'});
  ylabel ('time in seconds');
  title (titles{i});
end
../../../../_images/799b67b9fe89d4266986a168deeeedb1078a828c79fd1aec7289ad3e4d490e86.png

The results are not as overwhelming as I initially hoped for (they are, see comment above.)

The first figure compares the running times of Matlab, Octave, and octave-rapidjson. Both Octave and octave-rapidjson are based on RapidJSON.

It must be the choice of the API (DOM vs. SAX) that slows down the current Octave implementation (DOM) by a factor of 10 to 100 (wrong, see comment above).

octave-rapidjson, using the SAX API, is for the mixed data case not slower than Matlab. But the implementation itself is less Matlab compatible than the current Octave implementation. The choice of DOM for the current Octave implementation was made to achieve best compatibility to Matlab.

On the positive side, in the case of more numeric data (canada.json) the DOM API outperforms the SAX API. Nevertheless, my humble assumption is that mixed data is more common for JSON data files.

The results of JSONio and jsonlab are split into a second figure, as the running times are significantly larger than those of the first figure. For octave-jsonstuff we could due to an error not obtain any results. I’ll inform the maintainer to hopefully in the future repeat this benchmark.

Regarding this benchmark Octave should seriously consider switching to the SAX API and additionally preserve the current Matlab compatibility (see comment above).

GSoC 2020 is over, and Abdallah enriched Octave with a great new feature. When he (or someone else) ports the Octave function matlab.lang.makeValidName to the C++ language, the performance of JSON decoding and encoding is great and compatible to Matlab.