Introduction To Protocol Buffers Interface Definition Language - I

This guide describes how to use the protocol buffer language to structure your protocol buffer schema


This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .proto file syntax and how to generate data access classes from your .proto files. It covers the proto3 version of the protocol buffer's language.

So let's get started with defining our first schema!
. . .

Defining A Message Type

First, let's look at a very simple example. Let's say you want to define a search request message format, where each search request has a query string, the particular page of results you are interested in, and a number of results per page. Here's the .proto file you use to define the message type.
syntax = "proto3"; message SearchRequest {   string query = 1;   int32 page_number = 2;   int32 result_per_page = 3; }
  • The first line of the file specifies that you're using proto3 syntax: if you don't do this the protocol buffer compiler will assume you are using proto2. This must be the first non-empty, non-comment line of the file.
  • The SearchRequest message definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.

1. Specifying Field Types
In the above example, all the fields are scalar types: two integers (page_number and result_per_page) and a string (query). However, you can also specify composite types for your fields, including enumerations and other message types.

2. Assigning Field Numbers
As you can see, each field in the message definition has a unique number. These field numbers are used to identify your fields in the message binary format, and should not be changed once your message type is in use.
Note that field numbers in the range 1 through 15 take one byte to encode, field numbers in the range 16 through 2047 take two bytes.

So you should reserve the numbers 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.

3. Specifying Field Rules
Message fields can be one of the following:
  • singular: a well-formed message can have zero or one of this field (but not more than one). And this is the default field rule for proto3 syntax.
  • repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

In proto3, repeated fields of scalar numeric types use packed encoding by default.

4. Adding Comments
To add comments to your .proto files, use C/C++-style // and /* ... */ syntax.
/* SearchRequest represents a search query, with pagination options to  * indicate which results to include in the response. */ message SearchRequest {   string query = 1;   int32 page_number = 2;  // Which page number do we want?   int32 result_per_page = 3;  // Number of results to return per page. }

5. Reserved Fields
If you update a message type by entirely removing a field, or commenting it out, future users can reuse the field number when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto, including data corruption, privacy bugs, and so on.

One way to make sure this doesn't happen is to specify that the field numbers (and/or names, which can also cause issues for JSON serialization) of your deleted fields are reserved. The protocol buffer compiler will complain if any future users try to use these field identifiers.
message Foo {   reserved 2, 15, 9 to 11;   reserved "foo", "bar"; }

Note that you can't mix field names and field numbers in the same reserved statement.
. . .

Scalar Value Types

A scalar message field can have one of the following types – the table shows the type specified in the .proto file, and the corresponding type in the automatically generated class:



. . .

Optional Fields And Default Values

As mentioned above, elements in a message description can be labeled optional. A well-formed message may or may not contain an optional element. When a message is parsed, if it does not contain an optional element, the corresponding field in the parsed object is set to the default value for that field.

The default value can be specified as part of the message description. For example, let's say you want to provide a default value of 10 for a SearchRequest's result_per_page value.
optional int32 result_per_page = 3 [default = 10];

If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For bytes, the default value is the empty byte string. For bools, the default value is false.

For numeric types, the default value is zero. For enums, the default value is the first value listed in the enum's type definition. This means care must be taken when adding a value to the beginning of an enum value list.
. . .

Enumerations

When you're defining a message type, you might want one of its fields to only have one of a pre-defined list of values. For example, let's say you want to add a corpus field for each SearchRequest, where the corpus can be UNIVERSAL, WEB, IMAGES, LOCAL, NEWS, PRODUCTS or VIDEO.

You can do this very simply by adding an enum to your message definition - a field with an enum type can only have one of a specified set of constants as its value (if you try to provide a different value, the parser will treat it like an unknown field). In the following example we've added an enum called Corpus with all the possible values, and a field of type Corpus:
message SearchRequest {   required string query = 1;   optional int32 page_number = 2;   optional int32 result_per_page = 3 [default = 10];   enum Corpus {     UNIVERSAL = 0;     WEB = 1;     IMAGES = 2;     LOCAL = 3;     NEWS = 4;     PRODUCTS = 5;     VIDEO = 6;   }   optional Corpus corpus = 4 [default = UNIVERSAL]; }

You can define aliases by assigning the same value to different enum constants. To do this you need to set the allow_alias option to true, otherwise protocol compiler will generate an error message when aliases are found.
enum EnumAllowingAlias {   option allow_alias = true;   UNKNOWN = 0;   STARTED = 1;   RUNNING = 1; } enum EnumNotAllowingAlias {   UNKNOWN = 0;   STARTED = 1;   // RUNNING = 1;  // Uncommenting this line will cause a compile error inside Google and a warning message outside. }

Enumerator constants must be in the range of a 32-bit integer. You can define enums within a message definition, as in the above example, or outside – these enums can be reused in any message definition in your .proto file. You can also use an enum type declared in one message as the type of a field in a different message, using the syntax _MessageType_._EnumType_.

When you run the protocol buffer compiler on a .proto that uses an enum, the generated code will have a corresponding enum for Java or C++, or a special EnumDescriptor class for Python that's used to create a set of symbolic constants with integer values in the runtime-generated class
. . .

Using Other Message Types

You can use other message types as field types. For example, let's say you wanted to include Result messages in each SearchResponse message – to do this, you can define a Result message type in the same .proto and then specify a field of type Result in SearchResponse:
message SearchResponse {   repeated Result results = 1; } message Result {   string url = 1;   string title = 2;   repeated string snippets = 3; }

1. Importing Definitions
In the above example, the Result message type is defined in the same file as SearchResponse – what if the message type you want to use as a field type is already defined in another .proto file?

You can use definitions from other .proto files by importing them. To import another .proto's definitions, you add an import statement to the top of your file:
import "myproject/other_protos.proto";

By default, you can only use definitions from directly imported .proto files. However, sometimes you may need to move a .proto file to a new location.

The protocol compiler searches for imported files in a set of directories specified on the protocol compiler command line using the -I/--proto_path flag. If no flag was given, it looks in the directory in which the compiler was invoked. In general you should set the --proto_path flag to the root of your project and use fully qualified names for all imports.
. . .

Nested Types

You can define and use message types inside other message types, as in the following example – here the Result message is defined inside the SearchResponse message:
message SearchResponse {   message Result {     string url = 1;     string title = 2;     repeated string snippets = 3;   }   repeated Result results = 1; }

If you want to reuse this message type outside its parent message type, you refer to it as _Parent_._Type_:
message SomeOtherMessage {   SearchResponse.Result result = 1; }

You can nest messages as deeply as you like:
message Outer {                  // Level 0   message MiddleAA {  // Level 1     message Inner {   // Level 2       int64 ival = 1;       bool  booly = 2;     }   }   message MiddleBB {  // Level 1     message Inner {   // Level 2       int32 ival = 1;       bool  booly = 2;     }   } }
. . .

Updating A Message Type

If an existing message type no longer meets all your needs – for example, you'd like the message format to have an extra field – but you'd still like to use code created with the old format, don't worry! It's very simple to update message types without breaking any of your existing code. Just remember the following rules:
  • Don't change the field numbers for any existing fields.
  • If you add new fields, any messages serialized by code using your "old" message format can still be parsed by your new generated code. You should keep in mind the default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. See the Unknown Fields section for details.
  • Fields can be removed, as long as the field number is not used again in your updated message type. You may want to rename the field instead, perhaps adding the prefix "OBSOLETE_", or make the field number reserved, so that future users of your .proto can't accidentally reuse the number.
  • int32, uint32, int64, uint64, and bool are all compatible – this means you can change a field from one of these types to another without breaking forwards- or backward-compatibility. If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if you had cast the number to that type in C++ (e.g. if a 64-bit number is read as an int32, it will be truncated to 32 bits).
  • sint32 and sint64 are compatible with each other but are not compatible with the other integer types.
  • string and bytes are compatible as long as the bytes are valid UTF-8.
  • Embedded messages are compatible with bytes if the bytes contain an encoded version of the message.
  • fixed32 is compatible with sfixed32, and fixed64 with sfixed64.
  • For string, bytes, and message fields, optional is compatible with repeated. Given serialized data of a repeated field as input, clients that expect this field to be optional will take the last input value if it's a primitive type field or merge all input elements if it's a message type field. Note that this is not generally safe for numeric types, including bools and enums. Repeated fields of numeric types can be serialized in the packed format, which will not be parsed correctly when an optional field is expected.
  • enum is compatible with int32, uint32, int64, and uint64 in terms of wire format (note that values will be truncated if they don't fit). However be aware that client code may treat them differently when the message is deserialized: for example, unrecognized proto3 enum types will be preserved in the message, but how this is represented when the message is deserialized is language-dependent. Int fields always just preserve their value.
  • Changing a single value into a member of a new oneof is safe and binary compatible. Moving multiple fields into a new oneof may be safe if you are sure that no code sets more than one at a time. Moving any fields into an existing oneof is not safe.
. . .
In the next tutorial, we will look into the unknown fields and how to use them in Schema definition

On a mission to build Next-Gen Community Platform for Developers