Skip to content

How to read raw binary without definition, and re-write to binary? #736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
siebeneicher opened this issue Mar 30, 2017 · 5 comments
Closed
Labels

Comments

@siebeneicher
Copy link

For a project I need to read a binary without having its proto definition. Using protoc.exe from Google does print me out something readable, but further more I need to change specific content and than re-write the content to binary back.

Any general advice? Would I need to dive deep in the protocol to understand how to decode manually?

Or would you suggest using protoc.exe output, transform to lets say JSON, and rewrite it (with a somehow reverse-engineered proto)?

I am not necessary stuck to protobuf.js or any particular technology.

Any general advice is super welcome!

@dcodeIO
Copy link
Member

dcodeIO commented Mar 30, 2017

You could reverse engineer the definition. It's not that hard actually and once this is done, you'd not be limited anymore.

Alternatively, there is the low level API for working with the wire format (example) that could also help you to identify the format.

@siebeneicher
Copy link
Author

I find your example very intersting and I will continue that road!

So far I am analyzing this first part of a buffer:

0a df 11 32 9e 05 08 02 12 1c 0a 09 62 72 6f 77 73 65 5f 69 64 12 0f 46 45 77 68 61 74 5f 74 6f 5f 77 61 74 63 68 ...

But I struggled after some parts... hope you bear with me.

From what I understood:

// 0a = 10dec = 0000 1010 = msb: 0, id: 1, wiretype: 2
// df = 223dec = 1101 1111 = msb: 1
// 11 = 17dec = 0001 0001 = msb: 0 concat: 11+df => 001 0001 + 101 1111 => 2271dec

My conclusion is: wiretype 2 ldelim with 2271 length / bytes.

So thats why I do:

	var reader = protobuf.Reader.create(rbuffer);
	while (reader.pos < reader.len) {
	    var tag = reader.uint64();		// get max. 8bytes, does take MSB in consideration, returns full tag

		// 1st bit 		= msb
		// 2-4th bit 	= id
		// 5-8th bit 	= wire msg

	    var id = tag >>> 3;						// shift 3 bits out (id = 4 bits)
	    var wireType = tag & 7;					// decimal of last 3 bits
	    console.log(tag, wireType);

	    switch (wireType) {
	        case 2:
	        	var l = reader.uint64();
	            console.log(reader.string());
	            break;
	        default:
	            //reader.skipType(/*wireType*/ tag & 7);
	            break;
	    }
	}

here is console.log from the reader.string()

"������
browse_id��FEwhat_to_watch��
�context��yt_"

which looks not correct.

Parsing the same buffer with protoc.exe --decode_raw < buffer returns:

1 {
6 {
1: 2
2 {
1: "browse_id"
2: "FEwhat_to_watch"
}
2 {
1: "context"
2: "yt_android_w2w"
}
2 {
1: "has_unlimited_entitlement"
2: "False"
}
....

So expect I do miss something in the interpretation.

Is the string by chance nested and I have to apply the same process on the return from string() ??

How can I determine if its proto v2 or v3?

Very glad for any feedback from you!

Cheers,

Markus

@dcodeIO
Copy link
Member

dcodeIO commented Mar 30, 2017

So expect I do miss something in the interpretation.

Looks like it's not just bytes, but submessages, so ...

Is the string by chance nested and I have to apply the same process on the return from string() ??

Yep, but it's rather a buffer than a string. .bytes()

0a	id 1, wireType 2
df	95 (with msb)
11	17 (without msb) = 2271

either 2271 bytes of a string, of a buffer or a sub-message. let's assume a sub-message:

32	id 6. wireType 2
9e	30 (with msb)
05	5 (without msb) = 670

looks like a sub-message (also corresponds to what protoc outputs: note the 6, which is the field id here).

regarding protoc output, this continues. message structure is about:

message {
  field 6 (submessage) {
    field 1 (varint or fixed),
    field 2 (submessage) {
      field 1 (string),
      field 2 (string)
    }
    field 2 ... again, hence: repeated
  }
}

etc. As you see, protoc's output is a good indicator of the field ids to expect. It also indirectly shows possible data types (strings, submessages with braces, but numbers could be varints or fixed32/64 bits).

How can I determine if its proto v2 or v3?

You cannot. proto3 wire format does not differ from proto2, it's just the field declarations that are all implicitly optional and the introduction of language-level constructs like oneofs. When reverse-engineering, it's better to declare everything optional anyway, so it's safe to use proto3 here.

@siebeneicher
Copy link
Author

that makes sense. I assume, this buffer uses V3, because nested in V2 would have wiretype 3 or 4, no?

@dcodeIO
Copy link
Member

dcodeIO commented Apr 11, 2017

I assume, this buffer uses V3, because nested in V2 would have wiretype 3 or 4, no?

No, wiretype 3 and 4 are for legacy groups, a feature long deprecated in proto2 already. On the wire, proto2 and proto3 do not differ much, it's mostly language-level changes like all optional fields and new data types, but those new types use backward compatible encoding.

@dcodeIO dcodeIO closed this as completed Jun 9, 2017
@konsumer konsumer mentioned this issue Sep 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants