Robust and Undetectable White-Box Watermarks for Deep Neural Networks
Training deep neural networks (DNN) is expensive in terms of computational power and the amount of necessary labeled training data. Thus, deep learning models constitute business value for data owners. Watermarking of deep neural networks can enable their tracing once released by a data owner. In this paper we define and formalize white-box watermarking algorithms for DNNs, where the data owner needs white-box access to the model to extract the watermark. White-box watermarking algorithms have the advantage that they do not impact the accuracy of the watermarked model. We demonstrate a new property inference attack using a DNN that can detect watermarking by any existing, manually designed algorithms regardless of training dataset and model architecture. We then propose the first white-box DNN watermarking algorithm that is undetectable by the property inference attack. We further extend the capacity and robustness of the watermark. Unlike prior watermarking schemes which restrict the content of watermark message to short binary strings, our new scheme largely increase the capacity and flexibility of the embedded watermark message. Experiments show that our new white-box watermarking algorithm does not impact accuracy, is undetectable and robust against moderate model transformation attacks.
READ FULL TEXT