From 0 Doc Types to Full Type Declaration with Dynamic Analysis

2019-11-11

This post was updated at August 2020 with fresh know-how.
What is new?

Updated Rector YAML to PHP configuration, as current standard.

I wrote How we Completed Thousands of Missing @var Annotations in a Day. If you have at least some annotations, you can use Rector to do the dirty work.

To be honest, open-source is the top 1 % code there is, but out in the wild of ~~legacy~~ established PHP companies, it's a miracle to see just one string type declaration.

Are these projects lost? Do you have to quit them? And what if the annotations are lying?

I had a great trip with a friend of mine Dave Liddament after PHP Day 2019 in Verona. During Venice sightseeing in beautiful wild rain, we had a short coffee break to talk about Rector. Dave was amazed by what Rector can do for the developer, with such a few lines of YAML config.

As a proper curious developer, he challenged me with series of "Can it do...?" or "How can it...?" questions.

The one we spent almost an hour on one was:

"How can Rector help with type declarations, if there are no docblocks and no type hints?"

<?php

class SomeClass
{
    public function run($value)
    {
        return $value;
    }
}

First I was cold stone and said: "That's far beyond static analysis. That's a job for a human, Rector can't help here".

I vividly recall I was sure was this is impossible (← now this line my motto pick project that is interesting enough, lol).

But David sticks with it questioning:

"If we know there is always a string coming inside and nothing else, we could do this:"

 <?php

 class SomeClass
 {
-    public function run($value)
+    public function run(string $value): string
     {
         return $value;
     }
 }

"Well... yes, you're right."

Detect Every Argument Type

I started to see a very small candle at the end of the tunnel and said:

"We could do some kind of logging types that come to the method. Collect enough data and decide based on that."

There is a similar technique for dead-code analysis - tombs. But the problem is, it's not written in PHP. And if it's not written in the language we use, we are not able to extend it or fix it. That's why PHPStorm plugins written Java take so long to catch up with framework releases.

We wanted to have the code just once in the whole application. If possible in the end and collect all the method calls and their arguments. We tried to use register_shutdown_function and debug_trace for it. But after some time spends hacking them, we gave up.

So it will have to be good old static call under each class method, something like:

 <?php

 class SomeClass
 {
    public function run($value)
    {
+       TypeCollector::collect($value, __METHOD__, 0);
        return $value;
    }
 }

What about Performance?

file_put_contents() takes ~10 ms for 10 000 writes writes, so writing in filesystem might work.

Still, it's safer to use feature toggles or direct small fraction of traffic to a standalone server with these static methods.

How Long Should we Collect Data?

This needs to be tested in the wild. It depends on many factors, for a blogging platform, it can be a week of data. For a payment system, a month would be better, maybe more to be sure.

Also, the same way we collect data first with feature toggle/traffic fraction, we can test added types after they're added to the code.

The Simple Idea

To make the idea more solid, we looked for the edge cases:

"Wait, so we just collect types and then analyze them?"
"Yes, if there are 10 000 calls for the method and it gets a string in 100 % cases, it's a string."
"But, if 5 % of them is null... it will be nullable ?string"
"Exactly, and if it's 5 different types, there is nothing we can do."
"I see, but the point is to complete everything that can be completed, based on data and experience instead of docblock that can contain anything."

The idea is pretty clear, right?

How can we bring it to all PHP developers in Need?

It all seemed like a nice brain exercise for our brains... but we looked for practical appliance that would help every PHP developer in the world.

To automate this process fully, we came with 4 automated steps:

Add type collector to all public class and trait methods

 <?php

 class SomeClass
 {
    public function run($value)
    {
+       TypeCollector::collect($value, __METHOD__, 0);
        return $value;
    }
 }

1. Collect data for a 1-4 weeks

Complete collected types that can be added

 <?php

 class SomeClass
 {
-    public function run($value)
+    public function run(string $value)
     {
         TypeCollector::collect($value, __METHOD__, 0);
         return $value;
     }
 }

Remove type collector

 <?php

 class SomeClass
 {
     public function run(string $value)
     {
-        TypeCollector::collect($value, __METHOD__, 0);
         return $value;
     }
 }

This was May 2019 and it was just an idea. Now, 6 months later, I'm proud to say this 4-step process is now possible. I've merged the PR into Rector just a few minutes ago.

↓

Step 1 - Add Type Collector

use Rector\DynamicTypeAnalysis\Rector\ClassMethod\DecorateMethodWithArgumentTypeProbeRector;
use Rector\Config\RectorConfig;

return function (RectorConfig $rectorConfig): void {
    $rectorConfig->rule(DecorateMethodWithArgumentTypeProbeRector::class);
};

vendor/bin/rector process src

Step 2 - Wait for it...

Step 3 - Complete Collected Types

use Rector\DynamicTypeAnalysis\Rector\ClassMethod\AddArgumentTypeWithProbeDataRector;
use Rector\Config\RectorConfig;

return function (RectorConfig $rectorConfig): void {
    $rectorConfig->rule(AddArgumentTypeWithProbeDataRector::class);
};

vendor/bin/rector process src

Step 4 - Remove Type Collector

use Rector\DynamicTypeAnalysis\Rector\StaticCall\RemoveArgumentTypeProbeRector;
use Rector\Config\RectorConfig;

return function (RectorConfig $rectorConfig): void {
    $rectorConfig->rule(RemoveArgumentTypeProbeRector::class);
};

vendor/bin/rector process src

It's not Perfect, But Done

In all means, it's not perfect. There is still missing support for arrays, nested arrays, type co/ntra/variance, return types, union types, etc. But it's ready to be tested and prototype works (at least that's what unit tests say).

Now it's up to you. Make your code-base filled with real data it already uses. No guessing, no hoping, just science fully-automated.

Last but not least, thank you Dave for a great afternoon and sorry it took me so long to publish this.

Happy lazy coding!