Speeding Up JSON Parsing in R

How we used ‘jsonlite’ and ‘rjson’ in combination to plug our R service to Python Flask application server

by Ananya Harsh Jha (ananya.jha@elucidata.io) and Swetabh Pathak (swetabh.pathak@elucidata.io)

 

Screen Shot 2016-07-13 at 2.47.45 PM.png

Screen Shot 2016-07-13 at 2.48.10 PM.png

R is the tool of choice for many data scientists when it comes to statistical computing. So, when we had to develop a statistical computation service for our web application, it was only logical to do it in R. We decided that the computation service had to run independent of the application server. This would allow us to make changes in the R layer easily and scale by adding other services.

Once the R code had been written, we started to plug it into our application. That’s where we ran into some problems. This blog is about how we solved them. We hope that our experience would be useful for others trying to solve a similar problem.

First some details about the setup:

  1. Our computation module has a REST API (Flask Restful) that takes in data in form of JSON
  2. The computation module performs the following task after receiving the object:
    1. Parse JSON object into the required format
    2. Run R functions on the data
    3. Convert the computed result object to JSON
    4. Send this JSON object as a response back to the client, which in this case would be our application server
  3. The application server is written in Python Flask.

What we tried and did not work

Our first instinct was to use an R server for the computation service.  We came across ‘deployR’ by Revolution Analytics. It looked like a great option at the start. But the lack of an open source community and a set of unclear configuration error messages prompted us not to go forward with ‘deployR’.

When we were unable to find another robust solution in R, we decided to use a Flask server on top of the R module. It would call R functions using ‘rpy2’. The application server would post requests to the RESTful API of the computational layer. We started out by parsing JSON objects in Python to create an ‘rpy2‘ object. This ‘rpy2‘ object would then be used to call the R functions. This turned out to be painfully slow, making it unsuitable for our web application. The conversion from JSON to ‘rpy2‘ object was taking a lot longer that the actual computation in R.

To solve this, we started parsing JSON objects directly in R. The Flask API just forwards the JSON objects received from the client to R. Converting JSON to an R data frame turned out to be much faster than converting JSON to an ‘rpy2’ data frame.

Combining JSON parser: Saving milliseconds of the response time

Even after this what we had was not fast enough. Parsing JSON in R was now the rate-determining step. The two popular packages in R to handle JSON objects are ‘rjson’ and ‘jsonlite’. Most of the time, one would select one and go ahead. We decided to play around a bit with the parsing functions of the two libraries. We discovered that the running times of ‘toJSON()’ and ‘fromJSON()’ functions, of the two libraries, vary for different types of objects. We utilized this to our benefit to create different combinations of these functions. This in turn helped us speed up our current server response times. What follows below is a summary of the comparisons that we did.

JSON request object structure:

{
‘data’: ‘data frame JSON object’,
‘data vector’: ‘data vector JSON object’
}

Format of data frame JSON object:

{
‘col1’: {‘row1’: 1, ‘row2’: 2, …, ‘row-n’: 100},
‘col2’: {‘row1’: 5, ‘row2’: 10, …, ‘row-n’: 500},
‘col3’: {‘row1’: 10, ‘row2’: 20, …, ‘row-n’: 1000},
…,
‘col-n’: {‘row1’: 100, ‘row2’: 200, …, ‘row-n’: 10000}
}

Format of data vector JSON object:

{‘row1’: ‘label1’, ‘row2’: ‘label1’, ‘row3’: ‘label2’, …, ‘row-n’: ‘label1’}

Benchmark Results

Loading a JSON file from the local disk

object <- rjson::fromJSON(file = "request.json")

Average running time: 120ms

object <- jsonlite::fromJSON(txt = "request.json")

Average running time: 15ms

Parsing a JSON object from a client request

object <- rjson::fromJSON(json_object)

Average running time: 5ms

object <- jsonlite::fromJSON(json_object)

Average running time: 9s

Extracting a data frame from a JSON object

data <- rjson::fromJSON(object$data)

Average running time: 26s

data <- jsonlite::fromJSON(object$data)

Average running time: 1.4 seconds

Extracting a data vector from a JSON object

data_vector <- as.factor(unlist(rjson::fromJSON(object$data_vector)))

Average running time: < 1ms

data_vector <- as.factor(unlist(jsonlite::fromJSON(object$data_vector)))

Average running time: 1ms

Creating a JSON object from an R data frame

return_object <- rjson::toJSON(data_frame)

return object: {[1, 2, 3, 4, …, 1000]}

Average running time: 62ms

return_object

return object: 
[{‘col1’: 1, ‘col2’: 2, ..., ‘col-n’: 100, ‘_row’: ‘row1’},
 {‘col1’: 5, ‘col2’: 10, ..., ‘col-n’: 500, ‘_row’: ‘row2’}, 
 {‘col1’: 10, ‘col2’: 20, ..., ‘col-n’: 1000, ‘_row’: ‘row3’}, 
  …,
 {‘col1’: 100, ‘col2’: 200, ..., ‘col-n’: 10000, ‘_row’: ‘row-n’}]

Average running time: 27ms

In this case we go with ‘jsonlite::toJSON()’. It takes less time than ‘rjson::toJSON()’ and also retains the data frame’s meta data, such as row names, column names and dimension. This information is useful in recreating the data frame on the client side.
Creating a JSON object from an R vector

return_object <- rjson::toJSON(data_vector)

return object: {‘a’: 5, ‘b’: 10, …, ‘z’: 130}

Average running time: 7ms

return_object <- rjson::toJSON(data_vector)

return object: [5, 10, 15, …, 100]

Average running time: 2ms

We prefer the ‘rjson::toJSON()’ method because it retains the key-value pairing inside the JSON object. ‘jsonlite::toJSON()’ does not convert a named R vector to a key-value paired JSON object. We permit a 5ms overhead in this case because we saved that much in parsing the JSON object in step 2.
Combining multiple JSON objects to return to the client

To send back a combination of data frames and data vectors to a client, we convert them to JSON objects individually and append them to a list in R. Then we convert the entire thing to JSON again.

return_list <- list(
 data_vector <- rjson::toJSON(data_vector),
 data_frame <- jsonlite::toJSON(data_frame)
)

return(jsonlite::toJSON(return_list))

‘jsonlite::toJSON()’ is used here because it is faster than ‘rjson::toJSON()’ and retains the key value pairing when converting an R list to a JSON object.

Average server response time: ~ 0.5 seconds

Benchmark Parameters

Request data frame: 12 x 5000, ~ 1MB

Request data vector: 12 elements, 256 bytes

Response data frame: 5000 x 12, ~ 1 MB

Response data vector: 5000 elements, 0.1 MB

Setup: Flask server running on a local machine, calling R scripts via rpy2. The client in the above examples refers to the application server, which will query the computation module as a client.

Processor: 2.7 GHz Core i7 Quad Core

RAM: 16 GB DDR3 1600 MHz

Hard Disk: 512GB SATA3 SSD

We will be happy to learn that there are even better solutions. Comments/emails are most welcome.

Happy analysing!

Vipul Jain Written by:

One Comment

  1. November 23, 2018
    Reply

    You’ve got it in one. Couldn’t have put it better.

Leave a Reply

Your email address will not be published. Required fields are marked *