Introduction To Protocol Buffers Interface Definition Language - II

This guide describes how to use the protocol buffer language to structure your protocol buffer schema


This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .proto file syntax and how to generate data access classes from your .proto files. It covers the proto3 version of the protocol buffer's language.

In the previous tutorial, we looked at how to define a message type, Scalar, and Enums. In this tutorial we will look into the Unkown files are working with packages, JSON etc.

So let's get started with defining our schema using the unknown fields!
. . .

Unknown Fields

Unknown fields are well-formed protocol buffer serialized data representing fields that the parser does not recognize. For example, when an old binary parses data sent by a new binary with new fields, those new fields become unknown fields in the old binary.

Originally, proto3 messages always discarded unknown fields during parsing, but in version 3.5 we reintroduced the preservation of unknown fields to match the proto2 behavior. In versions 3.5 and later, unknown fields are retained during parsing and included in the serialized output.
. . .

Any

The Any message type lets you use messages as embedded types without having their .proto definition. An Any contains an arbitrary serialized message as bytes, along with a URL that acts as a globally unique identifier for and resolves to that message's type. To use the Any type, you need to import google/protobuf/any.proto.
import "google/protobuf/any.proto"; message ErrorStatus {   string message = 1;   repeated google.protobuf.Any details = 2; }
. . .

Oneof

If you have a message with many fields and where at most one field will be set at the same time, you can enforce this behavior and save memory by using the oneof feature.

Oneof fields are like regular fields except all the fields in a oneof share memory, and at most one field can be set at the same time. Setting any member of the oneof automatically clears all the other members. You can check which value in a oneof is set (if any) using a special case() or WhichOneof() method, depending on your chosen language.

1. Using Oneof

To define a oneof in your .proto you use the oneof keyword followed by your oneof name, in this case test_oneof:
message SampleMessage {   oneof test_oneof {     string name = 4;     SubMessage sub_message = 9;   } }

You then add your oneof fields to the oneof definition. You can add fields of any type, except map fields and repeated fields.

In your generated code, oneof fields have the same getters and setters as regular fields. You also get a special method for checking which value (if any) in the oneof is set.
. . .

Maps

If you want to create an associative map as part of your data definition, protocol buffers provide a handy shortcut syntax:
map<key_type, value_type> map_field = N;

...where the key_type can be any integral or string type (so, any scalar type except for floating point types and bytes). Note that enum is not a valid key_type. The value_type can be any type except another map.

So, for example, if you wanted to create a map of projects where each Project message is associated with a string key, you could define it like this:
map<string, Project> projects = 3;
  • Map fields cannot be repeated.
  • Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.
  • When generating text format for a .proto, maps are sorted by key. Numeric keys are sorted numerically.
  • When parsing from the wire or when merging, if there are duplicate map keys the last key seen is used. When parsing a map from text format, parsing may fail if there are duplicate keys.
  • If you provide a key but no value for a map field, the behavior when the field is serialized is language-dependent. In C++, Java, and Python the default value for the type is serialized, while in other languages nothing is serialized.

Backward compatibility: The map syntax is equivalent to the following on the wire, so protocol buffers implementations that do not support maps can still handle your data:
message MapFieldEntry {   key_type key = 1;   value_type value = 2; } repeated MapFieldEntry map_field = N;

Any protocol buffers implementation that supports maps must both produce and accept data that can be accepted by the above definition.
. . .

Packages

You can add an optional package specifier to a .proto file to prevent name clashes between protocol message types.
package foo.bar; message Open { ... }

You can then use the package specifier when defining fields of your message type:
message Foo {   ...   foo.bar.Open open = 1;   ... }

The way a package specifier affects the generated code depends on your chosen language:
  • In C++ the generated classes are wrapped inside a C++ namespace. For example, Open would be in the namespace foo::bar.
  • In Java, the package is used as the Java package, unless you explicitly provide an option java_package in your .proto file.
  • In Python, the package directive is ignored, since Python modules are organized according to their location in the file system.
  • In Go, the package is used as the Go package name, unless you explicitly provide an option go_package in your .proto file.
  • In Ruby, the generated classes are wrapped inside nested Ruby namespaces, converted to the required Ruby capitalization style (first letter capitalized; if the first character is not a letter, PB_ is prepended). For example, Open would be in the namespace Foo::Bar.
  • In C# the package is used as the namespace after converting to PascalCase, unless you explicitly provide an option csharp_namespace in your .proto file. For example, Open would be in the namespace Foo.Bar.

Packages and Name Resolution
Type name resolution in the protocol buffer language works like C++: first the innermost scope is searched, then the next-innermost, and so on, with each package considered to be "inner" to its parent package. A leading '.' (for example, .foo.bar.Baz) means to start from the outermost scope instead.

The protocol buffer compiler resolves all type names by parsing the imported .proto files. The code generator for each language knows how to refer to each type in that language, even if it has different scoping rules.
. . .

Defining Services

If you want to use your message types with an RPC (Remote Procedure Call) system, you can define an RPC service interface in a .proto file and the protocol buffer compiler will generate service interface code and stubs in your chosen language.

So, for example, if you want to define an RPC service with a method that takes your SearchRequest and returns a SearchResponse, you can define it in your .proto file as follows:
service SearchService {   rpc Search(SearchRequest) returns (SearchResponse); }

The most straightforward RPC system to use with protocol buffers is gRPC: a language- and platform-neutral open source RPC system developed at Google. gRPC works particularly well with protocol buffers and lets you generate the relevant RPC code directly from your .proto files using a special protocol buffer compiler plugin.
. . .

JSON Mapping

Proto3 supports a canonical encoding in JSON, making it easier to share data between systems. The encoding is described on a type-by-type basis in the table below.

If a value is missing in the JSON-encoded data or if its value is null, it will be interpreted as the appropriate default value when parsed into a protocol buffer. If a field has the default value in the protocol buffer, it will be omitted in the JSON-encoded data by default to save space. An implementation may provide options to emit fields with default values in the JSON-encoded output.



. . .

Options

Individual declarations in a .proto file can be annotated with a number of options. Options do not change the overall meaning of a declaration, but may affect the way it is handled in a particular context. The complete list of available options is defined in google/protobuf/descriptor.proto.

Some options are file-level options, meaning they should be written at the top-level scope, not inside any message, enum, or service definition. Some options are message-level options, meaning they should be written inside message definitions. Some options are field-level options, meaning they should be written inside field definitions. Options can also be written on enum types, enum values, oneof fields, service types, and service methods; however, no useful options currently exist for any of these.

Here are a few of the most commonly used options:
  • java_package (file option): The package you want to use for your generated Java classes. If no explicit java_package option is given in the .proto file, then by default the proto package (specified using the "package" keyword in the .proto file) will be used. However, proto packages generally do not make good Java packages since proto packages are not expected to start with reverse domain names. If not generating Java code, this option has no effect.
option java_package = "com.example.foo";
  • java_multiple_files (file option): Causes top-level messages, enums, and services to be defined at the package level, rather than inside an outer class named after the .proto file.
option java_multiple_files = true;
  • java_outer_classname (file option): The class name for the outermost Java class (and hence the file name) you want to generate. If no explicit java_outer_classname is specified in the .proto file, the class name will be constructed by converting the .proto file name to camel-case (so foo_bar.proto becomes FooBar.java). If not generating Java code, this option has no effect.
option java_outer_classname = "Ponycopter";
  • optimize_for (file option): Can be set to SPEED, CODE_SIZE, or LITE_RUNTIME. This affects the C++ and Java code generators (and possibly third-party generators) in the following ways:
  • SPEED (default): The protocol buffer compiler will generate code for serializing, parsing, and performing other common operations on your message types. This code is highly optimized.
  • CODE_SIZE: The protocol buffer compiler will generate minimal classes and will rely on shared, reflection-based code to implement serialialization, parsing, and various other operations. The generated code will thus be much smaller than with SPEED, but operations will be slower.
  • LITE_RUNTIME: The protocol buffer compiler will generate classes that depend only on the "lite" runtime library (libprotobuf-lite instead of libprotobuf). The lite runtime is much smaller than the full library (around an order of magnitude smaller) but omits certain features like descriptors and reflection. This is particularly useful for apps running on constrained platforms like mobile phones. The compiler will still generate fast implementations of all methods as it does in SPEED mode. Generated classes will only implement the MessageLite interface in each language, which provides only a subset of the methods of the full Message interface.
option optimize_for = CODE_SIZE;
  • cc_enable_arenas (file option): Enables arena allocation for C++ generated code.
  • objc_class_prefix (file option): Sets the Objective-C class prefix which is prepended to all Objective-C generated classes and enums from this .proto. There is no default. You should use prefixes that are between 3-5 uppercase characters as recommended by Apple. Note that all 2 letter prefixes are reserved by Apple.
  • deprecated (field option): If set to true, indicates that the field is deprecated and should not be used by new code. In most languages this has no actual effect. In Java, this becomes a @Deprecated annotation. In the future, other language-specific code generators may generate deprecation annotations on the field's accessors, which will in turn cause a warning to be emitted when compiling code which attempts to use the field. If the field is not used by anyone and you want to prevent new users from using it, consider replacing the field declaration with a reserved statement.
int32 old_field = 6 [deprecated = true];
. . .

Generating Your Classes

To generate the Java, Python, C++, Go, Ruby, Objective-C, or C# code you need to work with the message types defined in a .proto file, you need to run the protocol buffer compiler protoc on the .proto.

If you haven't installed the compiler, download the package and follow the instructions in the README. For Go, you also need to install a special code generator plugin for the compiler: you can find this and installation instructions in the golang/protobuf repository on GitHub.

The Protocol Compiler is invoked as follows:
protoc --proto_path=_IMPORT_PATH_ --cpp_out=_DST_DIR_ --java_out=_DST_DIR_ --python_out=_DST_DIR_ --go_out=_DST_DIR_ --ruby_out=_DST_DIR_ --objc_out=_DST_DIR_ --csharp_out=_DST_DIR_ _path/to/file_.proto

  • IMPORT_PATH specifies a directory in which to look for .proto files when resolving import directives. If omitted, the current directory is used. Multiple import directories can be specified by passing the --proto_path option multiple times; they will be searched in order. -I=_IMPORT_PATH_ can be used as a short form of --proto_path.
  • You can provide one or more output directives:
  • --cpp_out generates C++ code in DST_DIR.
  • --java_out generates Java code in DST_DIR.
  • --python_out generates Python code in DST_DIR.
  • --go_out generates Go code in DST_DIR.
  • --ruby_out generates Ruby code in DST_DIR.
  • --objc_out generates Objective-C code in DST_DIR.
  • --csharp_out generates C# code in DST_DIR.
  • --php_out generates PHP code in DST_DIR.

You must provide one or more .proto files as input. Multiple .proto files can be specified at once. Although the files are named relative to the current directory, each file must reside in one of the IMPORT_PATHs so that the compiler can determine its canonical name.
. . .
In the next tutorial, we will learn how to use Protocol Compiler to generate the class definitions for languages such as Python, C++ and Golang etc.

On a mission to build Next-Gen Community Platform for Developers