Support for the JSON data format after GSoC 2020#
Created: 2020-08-19
The Google Summer of Code (GSoC) 2020
will end with this August.
I had the pleasure to mentor
Abdallah Elshamy
working on the implementation of the jsondecode()
and jsonencode()
functions for Octave.
These function allow to convert JSON data
strings to Octave Objects and vice versa.
Last week we pushed most of Abdallah’s work to the main Octave repository, but he is still working on the functions and hopefully still after GSoC 2020 is over. Now as it is very convenient to use the JSON functions, I gave them a try with larger JSON data. Some of the test cases I collected from the excellent nativejson-benchmark, but with focus on Octave. Another test by Abdallah has been carried out in June to test the compatibility for Matlab.
Only the running times for reading and writing JSON data are regarded in this benchmark.
The test environment is a laptop with
16 GB of main memory
OpenSUSE 15.2 Linux
Octave version as given below
octave_version = version ()
octave_hg_id = version ('-hgid')
octave_version = 7.0.0
octave_hg_id = 173807014259
The following JSON extensions for Octave are under test.
name |
description |
---|---|
Based on RapidJSON, reading DOM API. |
|
Based on RapidJSON, reading SAX API. |
|
Based on RapidJSON, reading DOM API, writing m-file. |
|
Based on JSMN, writing m-file. |
|
m-file only |
The JSON test files are described in the following table.
name |
size (byte) |
description |
---|---|---|
1,727,204 |
Structured data with mixed text and numeric. |
|
2,251,060 |
Numeric data set in GeoJSON format. |
|
26,141,343 |
Structured data with mixed text and numeric. |
Benchmark setup#
Create a directory to keep track of the mess.
mkdir ('benchmark');
cd ('benchmark');
Load the benchmark JSON files.
if (exist ('citm_catalog.json', 'file') ~= 2)
urlwrite ( ...
'https://github.com/RichardHightower/json-parsers-benchmark/raw/master/data/citm_catalog.json', ...
'citm_catalog.json');
end
if (exist ('canada.json', 'file') ~= 2)
urlwrite ( ...
'https://github.com/mloskot/json_benchmark/raw/master/data/canada.json', ...
'canada.json');
end
if (exist ('large-file.json', 'file') ~= 2)
urlwrite ( ...
'https://github.com/json-iterator/test-data/raw/master/large-file.json', ...
'large-file.json');
end
Setup octave-rapidjson.
if (exist ('octave-rapidjson', 'dir') == 0)
urlwrite ( ...
'https://github.com/Andy1978/octave-rapidjson/archive/2d88511712032b14dea4c2272d82249e7547772a.zip', ...
'octave-rapidjson.zip');
unzip ('octave-rapidjson.zip');
rename ('octave-rapidjson-2d88511712032b14dea4c2272d82249e7547772a', ...
'octave-rapidjson');
cd ('octave-rapidjson')
urlwrite ( ...
'https://github.com/Tencent/rapidjson/archive/35e480fc4ddf4ec4f7ad34d96353eef0aabf002d.zip', ...
'rapidjson.zip');
unzip ('rapidjson.zip');
rename ('rapidjson-35e480fc4ddf4ec4f7ad34d96353eef0aabf002d', 'rapidjson');
mkoctfile -Wall -Wextra -I./rapidjson/include load_json.cc
mkoctfile -Wall -Wextra -I./rapidjson/include save_json.cc
cd ('..')
end
Setup octave-jsonstuff.
if (isempty (pkg ('list', 'jsonstuff')))
pkg install https://github.com/apjanke/octave-jsonstuff/releases/download/v0.3.3/jsonstuff-0.3.3.tar.gz
end
Setup JSONio.
if (exist ('JSONio', 'dir') == 0)
urlwrite ( ...
'https://github.com/gllmflndn/JSONio/archive/6c699a315ac2c578864d8b740a061bff47b718bf.zip', ...
'JSONio.zip');
unzip ('JSONio.zip');
rename ('JSONio-6c699a315ac2c578864d8b740a061bff47b718bf', 'JSONio');
cd ('JSONio')
mkoctfile --mex jsonread.c jsmn.c -DJSMN_PARENT_LINKS
cd ('..')
end
Setup jsonlab.
if (exist ('jsonlab', 'dir') == 0)
urlwrite ( ...
'https://github.com/fangq/jsonlab/archive/d0fb684bd43165d312063345bdb795b628b2c679.zip', ...
'jsonlab.zip');
unzip ('jsonlab.zip');
rename ('jsonlab-d0fb684bd43165d312063345bdb795b628b2c679', 'jsonlab');
end
Benchmark run#
The benchmark function reads the respective JSON file into a string and calls the libraries reading and writing function.
function t = benchmark (json_read_fcn, json_write_fcn)
test_files = {'citm_catalog.json', 'canada.json', 'large-file.json'};
N = length (test_files);
t = nan (N, 2);
for i = 1:N
json_str = fileread (test_files{i});
tic ();
octave_obj = json_read_fcn (json_str);
t(i,1) = toc ();
tic ();
json_str2 = json_write_fcn (octave_obj);
t(i,2) = toc ();
end
end
The results for the Matlab (R2020b, prerelease) have been measured on the same system without JupyterLab.
t.matlab = [
0.0768, 0.0853;
0.1510, 0.5405;
1.2222, 0.6521];
Octave (7.0.0, development version)
t.octave = benchmark (@jsondecode, @jsonencode);
octave-rapidjson
addpath ('octave-rapidjson')
t.rapid_json = benchmark (@load_json, @save_json);
rmpath ('octave-rapidjson')
octave-jsonstuff: No results due to an error.
%pkg load jsonstuff
%t.jsonstuff = benchmark (@jsondecode, @jsonencode);
%error: cat: field names mismatch in concatenating structs
%error: called from
% jsondecode>condense_decoded_json_recursive at line 116 column 9
% jsondecode>condense_decoded_json at line 67 column 7
% jsondecode at line 63 column 7
% benchmark at line 8 column 16
%pkg unload jsonstuff
JSONio: Because of the long running time, the results of the first run are saved here.
addpath ('JSONio')
%t.jsonio = benchmark (@jsonread, @jsonwrite);
t.jsonio = [ ...
0.9583, 30.5410;
6.1333, 17.4022;
4.3382, 552.8929];
rmpath ('JSONio')
Jsonlab: Because of the long running time, the results of the first run are saved here.
addpath ('jsonlab')
%t.jsonlab = benchmark (@loadjson, @savejson);
t.jsonlab = [ ...
35.6242, 26.0625;
6.1303, 0.7365;
372.2456, 601.5318];
rmpath ('jsonlab')
Benchmark results#
Update 2020-08-29: Abdallah found out that the speed problem (described blow) was the call to
matlab.lang.makeValidName
not the chosen DOM API.
graphics_toolkit ('qt')
titles = {'citm\_catalog.json (2 MB, mixed)', ...
'canada.json (2 MB, numeric)', ...
'large-file.json (26 MB, mixed)'};
for i = 1:3
subplot (3, 1, i);
bar ([t.matlab(i,:); t.octave(i,:); t.rapid_json(i,:)]');
legend ({'Matlab (R2020b, pre)', 'Octave (7.0.0, dev)', ...
'octave-rapidjson'}, 'Location', 'bestoutside');
xticklabels({'read','write'});
ylabel ('time in seconds');
title (titles{i});
end
for i = 1:3
subplot (3, 1, i);
bar ([t.jsonio(i,:); t.jsonlab(i,:)]');
legend ({'JSONio', 'jsonlab'}, 'Location', 'bestoutside');
xticklabels({'read','write'});
ylabel ('time in seconds');
title (titles{i});
end
The results are not as overwhelming as I initially hoped for
(they are, see comment above.)
The first figure compares the running times of Matlab, Octave, and octave-rapidjson. Both Octave and octave-rapidjson are based on RapidJSON.
It must be the choice of the API (DOM vs. SAX)
that slows down the current Octave implementation (DOM)
by a factor of 10 to 100 (wrong, see comment above).
octave-rapidjson, using the SAX API, is for the mixed data case not slower than Matlab. But the implementation itself is less Matlab compatible than the current Octave implementation. The choice of DOM for the current Octave implementation was made to achieve best compatibility to Matlab.
On the positive side, in the case of more numeric data (canada.json) the DOM API outperforms the SAX API. Nevertheless, my humble assumption is that mixed data is more common for JSON data files.
The results of JSONio and jsonlab are split into a second figure, as the running times are significantly larger than those of the first figure. For octave-jsonstuff we could due to an error not obtain any results. I’ll inform the maintainer to hopefully in the future repeat this benchmark.
GSoC 2020 is over,
and Abdallah enriched Octave with a great new feature.
When he (or someone else) ports the Octave function
matlab.lang.makeValidName
to the C++ language,
the performance of JSON decoding and encoding is great
and compatible to Matlab.