Programster's Blog

Tutorials focusing on Linux, programming, and open-source

PHP - Working With Large JSON Files

I'm working on a project now where I need to ingest and output large amounts of data between systems, without direct access to databases. This has come up on past projects with CSV files, but in this case I am using JSON for various reasons which changes things quite a bit.

CSV files are somewhat easier to work with when dealing with large amounts of data because each record is on its own line. Thus it's easy to create a basic file parser that will do the job by just reading one line at a time. However, with JSON, the file could be formatted in multiple different ways, with a single object possibly spanning multiple lines, or there may be just one single massive line of data containing all the objects.

I could have tried to write my own tool to handle this issue, but luckily somebody else has already solved this for us. In this case I am going to demonstrate the usage of the JSON machine PHP package to process an extremely large JSON file.

Setup

First we need to create an artificial large JSON file to simulate our issue. One could use something like an online JSON generator, but my browser would crash when I set a really high number of objects to create. Hence I used the following basic script to simulate my use-case of a massive array of items that have a depth of 1 (e.g. just name/value pairs).

<?php

$numItems = 1000000;
$items = [];

for ($i=0; $i<$numItems; $i++)
{
    $items[] = [
        "uuid" => md5(rand()),
        "isActive" => md5(rand()),
        "balance" => md5(rand()),
        "picture" => md5(rand()),
        "age" => md5(rand()),
        "eyeColor" => md5(rand()),
        "name" => md5(rand()),
        "gender" => md5(rand()),
        "company" => md5(rand()),
        "email" => md5(rand()),
        "phone" => md5(rand()),
        "address" => md5(rand()),
        "about" => md5(rand()),
        "registered" => md5(rand()),
        "latitude" => md5(rand()),
        "longitude" => md5(rand()),
    ];
}

print json_encode($items, JSON_PRETTY_PRINT);

I made sure to test this with and without the use of JSON_PRETTY_PRINT. This results in the generated file having different formatting, but the end result of this tutorial is exactly the same.

This generated me an 843 MB file of one million items, which I feel is suitably large enough for stress testing.

Running

Now that we have a suitably large file, we need to process it.

First we need to install the JSON Machine package:

composer require halaxa/json-machine

Then we can use it in a script like so:

<?php

require_once(__DIR__ . '/vendor/autoload.php');
$products = JsonMachine\JsonMachine::fromFile('large-file.json');

foreach ($products as $product)
{
    $productData = json_encode($product, JSON_PRETTY_PRINT);
    print($productData) . PHP_EOL;
}

This doesn't actually do anything that useful. It just prints out each object one-by-one. However it does demonstrate that we can safely loop over all the items in the JSON file one at a time without running out of memory etc. We can take this further and write some code to possibly batch insert them 1,000 at a time into a database or do some sort of operation before outputting to another file.

Last updated: 18th March 2021
First published: 18th March 2021

This blog is created by Stuart Page

I'm a freelance web developer and technology consultant based in Surrey, UK, with over 10 years experience in web development, DevOps, Linux Administration, and IT solutions.

Need support with your infrastructure or web services?

Get in touch