Putting the AI in Gongkai (公开)
The intellectual property gaps that will enable innovation to flourish in AI
In Chinese, there is a concept shanzhai (山寨). The term originated in fiction, directly translating to mountain fortress. It describes outlaws who act virtuously and rebel against an unjust state and was later adopted to refer to counterfeit electronics. It was not limited to straight counterfeits though, as these imitation devices often serve as a platform for innovation, encompassing novel features that OEMs would never produce.
The canonical example of this is the mobile phone market in Shenzhen, China. You can get an Android phone whose case and OS have been skinned to look very similar to an iPhone. Beyond that though, you can also get phones with cigarette lighters, e-ink screens, physical keyboards, and ultraviolet flashlights. While useful, these are features that a major manufacturer will never include in due to them serving too small of a market.
These permissionless innovations are possible due to a different attitude toward intellectual property than what is common in the United States. Documents, know-how, and collaboration flow more freely in Shenzhen. The knowledge a factory worker gains through their normal duties might carry over to the work they do moonlighting, schematics might be shared in order to close a sale of some microchips, and one might freely help a friend to secure that friend’s help in future endeavors.
The term open source does not accurately capture the spirit of this system. Bunnie Huang coined gongkai (公开) to describe it. It is the open sharing of what would generally be proprietary information as well as the system of collaboration and innovation that it enables.
Intellectual Property in Deep Learning
Intellectual property in deep learning is far less well defined than in established arenas like mobile phone manufacturing. There are three areas we need to consider when looking at copyright - datasets, model weights, and model outputs.
Datasets
Datasets and the contents inside of them can both be covered by copyright. There are certain scenarios, under fair use, in which copyrighted content can be used in the compilation of a dataset.
Fair use and copyrighted inputs deserve a post of their own. So for purposes of this post, we will assume that datasets consist only of original content and annotations.
A company or individual who compiles a dataset can apply a copyright to that dataset. Just like software, it can be released under a license that places restrictions on its use.
Model Weights
The online consensus is that weights, derived from a dataset, are unlikely to be copyrightable. This is primarily due to the fact that non-humans cannot hold copyright and the model weights are the result of an automated process, not a work of authorship. They are often likened to a set of facts about the dataset rather than being considered an original work.
That being said, a binary executable, which can be copyrighted, is the result of an automated process as applied to source code. The copyright code has an explicit definition of computer program as a set of statements or instructions. While it would be difficult to fit model weights to this definition, we can see a future scenario in which model weights, or the topology that they represent, are considered a work of authorship as derived from the originality and curation of the source dataset.
There may come a day, in the not too distant future, where model weights can be covered by copyright and released under specific licenses.
Model Outputs
Depending on the model, the output can be copyrightable. However, the ownership of this copyright will generally belong to the user and not the entity that produced the model.
A model that generates bounding boxes for objects in an image is outputting information, and if the model is correct, possibly a fact. Since facts cannot be copyrighted, these outputs are likely not covered. Additionally, there would not be many entities with a desire to copyright them.
The outputs of generative models are a much more interesting case. The generative model itself cannot hold copyright. That is to say, if the model is viewed as the author then the outputs are not covered by copyright.
On the other hand, if the model is viewed as a tool through which a human author, via prompting, produces an original work then that output can be covered by copyright. This would be treating the model in the same light as an image manipulation tool or word processor; a medium for human expression.
Under current US copyright regime, I can see no case in which the developers of a model would be the owners of its output besides through a strict licensing agreement under which the user assigns them copyright. Such a scheme would assume that the model’s outputs are copyrightable in the first place.
Circumvention
One might want to release and imbue their intellectual property (dataset, weights, outputs) with certain clauses. These could be a non-commercial clause, the requirement to disclose all sources of training data on models, or the requirement that any derivative works be released openly under the same license.
This falls apart rapidly in the deep learning space. Let us consider a hypothetical dataset that has been built and released for non-commercial use online. A model trained on that dataset does not necessarily inherit the dataset’s non-commerciality. Further, the outputs of that model would also not inherit the dataset’s non-commerciality.
With this, there is a very clear path to laundering the dataset into a new dataset unencumbered by the non-commercial clause. You would develop a model and simply use that model to label a new training set. Releasing that new training set under a permissive license will have successfully undermined the restrictive clauses of the original dataset.
This path works for the model weights as well. If the weights are restricted you can generate outputs, compile them into a new dataset, and train for entirely new weights. This may violate a license (such as Llama 2), but it is unclear if the consequences of doing such will result in anything more than a termination of the license agreement.
For the outputs, we have seen that even with unlicensed copyrighted material there are cases under fair use where this material can be compiled into datasets and used. Any restriction has the potential to be discarded in court under a fair use test. Further, many outputs will not be copyrightable and for the rest it will likely be up to the user of the model to assign copyright terms.
We can see that in all cases, restrictive clauses are readily laundered away. It may not be feasible to do this for the largest models due to cost, but by chaining this process two or three times any entity would be hard pressed to prove an infringing use.
Two Paths
The Three Tier License
For a licensing scheme or any restrictive clause to matter, it would have to extend from dataset, to model weights, to model outputs, and back to datasets. Assuming, generously, that copyright can be asserted over each of these elements, let’s examine a possible licensing scheme.
If we released a dataset for non-commercial use, the license would have to extend the non-commerciality to model weights as well as the outputs produced by those weights. For example -
Use of this dataset, model weights, and model outputs is granted to you for non-commercial purposes only.
This license shall apply to any model weights derived, in whole or in part, from a covered dataset.
You agree to assign this license to any outputs generated by covered model weights.
This license shall apply to any dataset that includes covered outputs.
A license like this will likely fall apart and be rendered unenforceable. That could be through fair use of the outputs or through the non-copyrightability of model weights.
If it did hold, effectively tracking licensing requirements for disparate data sources, communicating them to users, and enforcing their terms would either spawn a regulation-induced industry of immeasurable “value” or turn anything carrying these licenses into toxic waste.
Embracing Openness
The more natural path is to adopt the ethos of shanzhai and gongkai in deep learning research.
山寨
Shanzhai is almost a necessity. We cannot wait around for big tech or academia to meet our needs. They have proven what is possible and laid the foundation, but it falls on us to harness it and innovate. Apple will never include a cigarette lighter on the iPhone and the ad tech giants will never develop an IRL adblocker for augmented reality devices.
Use cases that go against corporate interests, ones that serve too small of a market, and subjects that are taboo in academia will not be developed without a rebellious spirit of innovation.
公开
Under current US copyright law it would be a challenge to restrict the positive feedback loops of progress in artificial intelligence. Attempts to lock everything down might destroy the very ecosystem that enables so much innovation.
We should leverage this situation by embracing gongkai, sharing openly with each other, freely remixing ideas, and collaborating to build the future. We have the opportunity to operate without the restrictions that traditionally gate innovation in the United States. If we simply follow this natural flow, society and technology can progress at unimaginable speeds.
We should extend the period we are in as long as we can to ensure that development and access to AI are democratized. This will be looked back upon as the golden age of deep learning. Go forth and innovate.