To be honest, I have no idea what I'm doing. I've read a couple of posts about parallel processes in PHP, but most got me confused even more than before. Too much vague theory links to dozens of open-source packages, 5 alternatives to one operation, and other education faults.
What I missed was a to-do list for a 6-year old PHP programmer. Straightforward, with everyday terminology developers, already know.
Do you want to have a better idea of how to add a parallel run to one of PHP CLI apps?
This post will get you from 0 to padawan in a couple of minutes.
Disclaimer: if you do parallel for a couple of years, this post is not for you. This post will only confuse you with incorrect interpretations that you have to correct in tweets and comments. This post is not for experts but for those who want to try it today for the first time.
"If you can't explain it to a 6-year-old,
you don't understand it yourself."
Last month I tweeted about 16x faster ECS, the most significant performance improvement I've ever seen since upgrade to PHP 7.
I got one question about the architecture:
Blog post coming on how you achieved it? It would be good to have blog post on how to do parallel run efficiently in PHP.
— Ishan Vyas (@Ishanvyas22) October 7, 2021
Today I'll share my limited experience with parallel CLI PHP Apps. It's an experience I got by exploring PHPStan code and hundreds of trials and errors. What is CLI PHP App? A PHP tools that you run in command line - ECS, php-cs-fixer, PHP_CodeSniffer, PHPStan, Rector, PHPUnit, Composer etc.
Is all clear? Let's start.
I met with parallel in a live stream 4 years ago. My first problem with a parallel run was that developers who talked about it made the topic sound very complex. I asked one question to understand one concept better, but in the end, I was even more confused than before I asked.
That made me think:
I have good news for you - none of it is true. You just have to be lucky to come around sources that make you feel smarter.
The first point is: it's simpler than you think.
We don't implement it because it's cool, not because PHP allows it or not because it improves our architecture.
We want to get somewhere significantly faster. We're talking 10-20x faster.
Last year, my laptop got a little shower from wild traveling and decided to stop working. Czech law gives the seller a month to process the warranty, so I had to get a replacement for the next month.
I bought the first Lenovo Thinkpad that looked similar to the one I used, so I don't have to learn a new keyboard for a single month. I got a surprise: the PHPStan run was cut down to half.
Why? The parallel run is as x-faster, where x is a number of CPU threads. It's not about CPU cores, but about CPU threads. In my temporary laptop, there was an AMD Ryzen CPU that had 8 cores but excellent 16 threads.
That means every parallel process based on CPU cores is 16x faster.
Have you waited 2 minutes to finish a command-line process? Now it's 8 seconds.
Typical ECS command looks like this:
vendor/bin/ecs check src
This command finds all PHP files in the /src
directory and runs foreach to check for coding standard violations. Roughly like this:
$foundFiles = $this->findFiles(__DIR__ . '/src');
foreach ($foundFiles as $foundFile) {
$this->codingStandardApplication->processFile($foundFile);
}
Before the 2nd file can be processed by coding standard, we have to wait for the 1st file to finish.
This is the bottleneck.
How to start with parallelization? Look for "the main" foreach (...)
in your code.
What do you do when you need a repository service in your project? We inject it via the constructor and use it. It has access to a database, where are data all up-to-date, and we can load, edit and delete them. We trust the stability.
In parallel, this is a bit different. How?
This point started as a few sentences, but soon grew to its own post with. It's a metaphor that hits the nail on the head.
Go read Parallel in PHP for Dummies? Cooking a Family Dinner and then come back for the best experience of this list.
So now we know the processes run separately, each in its paste. But above we still have foreach. How do we run them separately without waiting for each other?
We refactor services call to another command-line command:
foreach ($familyMembers as $key => $familyMember) {
$ingredientsChunk = $ingredientsChunks[$key];
- $foundIngredients[] = $familyMember->findIngredients($ingredientsChunk);
+ $foundIngredients[] = exec(
+ 'vendor/bin/find-ingredient --member $familyMember --chunk $ingredientsChunk
+ );
}
This way, we create as many subcommands on the background as many family members we have. Each of them runs separately.
How does this work in ECS? Before, we had one command to process all the files:
vendor/bin/ecs check /src
Now the main command is the same, but it runs itself on the background in multiple threads:
# this is what we type
vendor/bin/ecs check /src
# this is what really happens
→ vendor/bin/ecs check-worker --cpu-thread 1 --files /src/first.php /src/second.php
→ vendor/bin/ecs check-worker --cpu-thread 2 --files /src/third.php /src/fourth.php
What is the check-worker
command exactly doing? It's the exact copy of the check
command.
The check
command used to be foreach (...)
caller of service, but now it calls standalone processes.
This step was blowing for me. The typical run of ECS checked files for coding standard violations and printed the errors - all inside on PHP container:
vendor/bin/ecs check /src
Found 25 errors. Fix them with the "--fix" option.
But how can we work with nested command calls? We do only have bash there, no PHP, no services, no container. Like when we call external API:
curl /app/find-ingredient --member 1 --chunk onion,garlic
Does this remind you of something? What kind of response do we get when we call an API?
curl /app/find-ingredient --member 1 --chunk onion,garlic
{"onion": "found", "garlic": "not_found"}
A JSON!
So when we call the ECS worker command, we expect the JSON:
→ vendor/bin/ecs check-worker --cpu-thread 1 --files /src/first.php /src/second.php
{"/src/first.php": {"error_count": 0}, "/src/second.php": {"error_count": 3}}
This step makes sense to the whole previous workflow. It means we only have to return primary data. We cannot return services, value objects or nested arrays, or metadata. Only return what you need to show the user.
To give you an idea, in ECS, the result for a single file looks like this:
[
{
"file_path": "/src/first.php",
"error_messages": [
"Use spaces over tabs"
],
"file_diffs": [
"-$value=1;\n;$value = 1;"
]
}
]
This bonus tip is not limited to parallel, but it's a general lifesaver in an unstable environment.
Seeing arrays and strings above might give you shivers. How can we work with such unreliable data and pass them around our application? I feel you. When I don't have an object in my hand, I feel like I'm naked.
Let's put on pants and use value objects the instant we can:
final class FileResult implements JsonSerializable
{
public function __construct(
private string $filePath,
private array $errorMessages,
private array $fileDiffs,
) {
}
// we'll use this method in worker command to send the JSON result
public function jsonSerialize(): array
{
return [
'file_path' => $this->filePath,
'error_messages' => $this->errorMessages,
'file_diffs' => $this->fileDiffs,
];
}
}
When the worker command returns a string response, we'll turn it into value objects:
// string
$checkWorkerResult = exec(
'vendor/bin/ecs check-worker --cpu-thread 1 --files /src/first.php /src/second.php'
);
// json
$checkWorkerJson = Json::decode($checkWorkerResult);
// array of FileResult value objects
$fileResults = [];
foreach ($checkWorkerJson as $fileResultJson) {
$fileResults[] = new FileResult(
$fileResultJson['file_path'],
$fileResultJson['error_messages'],
$fileResultJson['file_diffs']
);
}
That's it! Give it time, start slowly and make small pull requests.
Happy coding!
Do you learn from my contents or use open-souce packages like Rector every day?
Consider supporting it on GitHub Sponsors.
I'd really appreciate it!