apple_codesign/
specification.rs

1// This Source Code Form is subject to the terms of the Mozilla Public
2// License, v. 2.0. If a copy of the MPL was not distributed with this
3// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4
5/*! Apple code signing technical specifications
6
7This document outlines how Apple code signing is implemented at a technical
8level.
9
10# High Level Overview
11
12Mach-O binaries embed an optional binary blob containing code signing
13metadata. This binary blob contains content digests of various aspects
14of the binary (such as the executable code) as well as an optional
15cryptographic signature which effectively attests to the digested
16content of the binary.
17
18At run-time, stored digests are used to help ensure file integrity.
19
20The cryptographic signature is used to verify the digests haven't
21been tampered with as well as to validate trust with the entity that
22produced that signature.
23
24See
25<https://developer.apple.com/library/archive/technotes/tn2206/_index.html#//apple_ref/doc/uid/DTS40007919>
26for an additional overview of how code signing works on Apple platforms.
27
28# The Important Data Structures
29
30Mach-O is the executable binary format used by Apple platforms. A
31Mach-O binary contains (among other things), a series of named *segments*
32holding arbitrary data and *load commands* instructing the loader how
33to load/execute the binary.
34
35Code signing data is embedded within the `__LINKEDIT` segment in a Mach-O
36binary. An `LC_CODE_SIGNATURE` load command identifies the offsets of
37code signing data within `__LINKEDIT`.
38
39The code signing data within a `__LINKEDIT` segment is itself a collection
40of sub-records. A *SuperBlob* header defines the signing data format, the
41length of data to follow, and the number of sub-sections, or *Blob* within.
42Each *Blob* occupies a defined *slot*. *Slots* are effectively well-known
43pieces of signing data. These include a *Code Directory*, *Entitlements*,
44and a *Signature*, among others. See the [crate::CodeSigningSlot]
45enumeration for the known defined slots.
46
47Each *Blob* contains its own header magic effectively identifying the
48content type within and how bytes should be interpreted. The magic
49values are independent of the *slot* type. However, there appears to be
50a relationship between the two. For example, the code directory slot
51will have header magic identifying the payload as a code directory structure.
52
53The *Code Directory* blob/slot defines information about the binary
54being signed. There are many fields to this data structure. But the most
55important ones to understand are the hashes / content digests. The *Code
56Directory* contains digests (e.g. SHA-256) of various content in the binary,
57such as Mach-O segment data (i.e. the executable code) and other blobs/slots.
58
59The *Entitlements* blob/slot contains a *plist*.
60
61Additional file-based resources can also be signed. These are referred to as
62*Code Resources*. *Code Resources* are captured in a
63`_CodeSignature/CodeResources` XML plist file in the bundle and the digest
64of this file is captured by the *Code Directory*. There is a defined
65`RESOURCEDIR` slot to hold its digest. However, there is no explicit
66magic constant for resources, implying that this data can only be provided
67externally and not embedded within the *SuperBlob*.
68
69The *Signature* blob/slot contains a Cryptographic Message Syntax (CMS)
70RFC 5652 defined `SignedData` BER encoded ASN.1 data structure. CMS is
71a specification for cryptographically signing arbitrary content. The
72`SignedData` structure contains an additional set of *signed attributes*
73(think of it as arbitrary extra content to sign), a cryptographic signature
74of the signed data, and likely the X.509 certificate of the signer and its
75chain of certificate signers.
76
77# How Signing Works
78
79Code signing logically consists of the following steps:
80
811. Collecting content that needs to be signed/attested/trusted.
822. Computing content digests.
833. Cryptographically signing a message derived from the content digests.
844. Adding signature data to Mach-O binary.
85
86## Collecting Content
87
88Embedded code signatures support signing a myriad of data formats.
89These include but aren't limited to:
90
91* The Mach-O data outside the signature data in the `__LINKEDIT` segment.
92* Requested entitlements for the binary.
93* A code requirement statement / expression.
94* Resource files.
95
96If your binary is already part of a *bundle*, content collection can
97occur automatically using heuristics. e.g. the `Contents/Resources`
98directory contains additional files whose content should be signed.
99
100## Computing Content Digests
101
102Once content has been assembled, a series of digests are computed.
103
104For the code digests, the Mach-O segments are iterated. The raw segment
105data is chunked into *pages* and each hashed separately. This is to allow
106code data to be lazily hashed as a page is loaded into the kernel.
107(Otherwise you would have to hash often megabytes on process start, which
108would add overhead.)
109
110Code hashes are a bit nuanced. A hash is emitted at segment boundaries. i.e.
111hashes don't span across multiple segments. The `__PAGEZERO` segment is
112not hashed. The `__LINKEDIT` segment is hashed, but only up to the start
113offset of the embedded signature data, if present.
114
115Other content (such as the entitlements, code requirement statement, and
116resource files) are serialized to *Blob* data. The mechanism for this
117varies by type. e.g. the entitlements plist is embedded as UTF-8
118data and the code requirement statement is serialized into an expression
119tree. The resulting *Blob* is then digested.
120
121The content digests are then assembled into a *Code Directory* data
122structure. Digests of code data are referred to to *code slots* and
123digests of other entitles (namely *Blob* data) occupy *special slots*.
124The *Code Directory* also contains important other information, such
125as describing the hash/digest mechanism used, the page size for code
126hashing, and executable limits for the binary.
127
128The content of the *Code Directory* serialized to a *Blob* is then itself
129digested. This value is known as the *code directory hash*.
130
131## Cryptographic Signing
132
133A cryptographic signature is produced using the Cryptographic Message
134Syntax (CMS) signing mechanism.
135
136From a high level, CMS takes as inputs:
137
138* Optional content to sign.
139* Optional set of additional attributes (effectively key-value data) to sign.
140* A signing key.
141* Information about the signing key (including its CA chain).
142
143From these, CMS will produce a BER encoded ASN.1 blob containing the
144cryptographic signature and sufficient metadata to verify it (such
145as the signed attributes and information about the signing certificate).
146
147In CMS speak, the *encapsulated content* being signed is not defined.
148However, the `message-digest` signed attribute is the digest of the
149*Code Directory* *Blob* data. (This appears to be not compliant with RFC 5652,
150which says *encapsulated content* should be present in the *SignedObject*
151structure. Omitting the data is likely done to avoid redundant storage
152of this data in the Mach-O binary and/or to simplify parsing, as *Code
153Directory* data wouldn't be embedded within an ASN.1 stream.)
154
155In addition, there is a signed attribute for the signing time. There is
156also an XML plist defining an array of base64 encoded *Code Directory*
157hashes. There are multiple *slots* in a *SuperBlob* for code directories
158and the array in the signed XML plist appears to allow hashes of all of
159them to be recorded.
160
161(TODO it isn't clear what the signed content is when there are multiple
162*Code Directory* slots in use. Presumably `message-digest` is computed
163over all of them.)
164
165CMS will concatenate the *Code Directory* data with the DER serialized
166ASN.1 structures defining the *signed attributes*. This becomes the
167*plaintext* message to be signed.
168
169This *plaintext* message is combined with a private key and cryptographically
170signed (likely using RSA). This produces a *signature*.
171
172CMS then serializes the *signature*, *signed attributes*, signer
173certificate info, and other important metadata to a BER encoded ASN.1
174data structure. This raw slice of bytes is referred to as the
175*embedded signature*.
176
177## Adding Signature Data to Mach-O Binary
178
179The above steps have already materialized several *Blob* data
180structures. The individual pieces like the entitlements and code requirement
181*Blob* were materialized in order to compute their hashes for the *Code
182Directory* data structure. And the *Code Directory* *Blob* was constructed
183so it could be signed by CMS.
184
185The *embedded signature* data produced by CMS is assembled into a *Blob*
186structure. At this point, we have all the *Blob* ready.
187
188All the *Blobs* are assembled together into a *SuperBlob*. The
189*SuperBlob* is then written to the `__LINKEDIT` segment of the
190Mach-O binary. An appropriate `LC_CODE_SIGNATURE` load command is
191also written to the Mach-O binary to instruct where the *SuperBlob*
192data resides.
193
194The `__LINKEDIT` segment is the last segment in the Mach-O binary and
195the *SuperBlob* often occupies the final bytes of the `__LINKEDIT`
196segment. So in many cases adding code signature data to a Mach-O
197requires an optional truncation to remove the existing signature then
198file appends for the `__LINKEDIT` data.
199
200However, insertion or removal of `LC_CODE_SIGNATURE` will require
201rewriting the entire file and adjusting offsets in various Mach-O
202data structures accordingly. In many cases, an existing code signature
203can be replaced by truncating the `__LINKEDIT` section, writing the
204replacement data, and updating sizes/offsets in-place in the segments
205index and `LC_CODE_SIGNATURE` load command.
206
207Note that there is a chicken-and-egg problem related to writing the
208Mach-O binary and computing the digests of that binary for the *Code
209Directory*! The *Code Directory* needs to compute a digest over the
210content of the Mach-O file up until the signature data. But this needs
211to be done before a CMS signature is produced, as we need to digest
212the *Code Directory* for a CMS signed attribute. We also need to know
213the size of the CMS signature, as it is part of the signature data
214embedded in the Mach-O binary and its size needs to be recorded in
215the `LC_CODE_SIGNATURE` load command and segment definitions, which
216are hashed by the *Code Directory*. This is a circular dependency. A
217trick to working around it is to pad the Mach-O signature data with
218extra NULLs and record this extra long value in `LC_CODE_SIGNATURE`
219before code digests are computed. The *SuperBlob* parser appears to
220be lenient about this solution. Further note that calculating the
221exact final length before CMS signature generation may be impossible
222due to the CMS signature being non-deterministic (due to the use of
223signing times and timestamp servers tokens, which could be variable
224length).
225
226# How Bundle Signing Works
227
228Signing bundles (e.g. `.app`, `.framework` directories) has its own
229complexities beyond signing individual binaries.
230
231Bundles consist of multiple files, perhaps multiple binaries. These files
232can be classified as:
233
2341. The main executable.
2352. The `Info.plist` file.
2363. Support/resources files.
2374. Code signature files.
238
239When signing bundles, the high-level process is the following:
240
2411. Find and sign all nested binaries and bundles (bundles can contain
242   other bundles) except the main binary and bundle.
2432. Identify support/resources files and calculate their hashes, capturing
244   this metadata in a `CodeResources` XML file.
2453. Sign the main binary with an embedded reference to the digest of the
246   `CodeResources` file.
247
248# How Verification Works
249
250What happens when a binary is loaded? Read on to find out.
251
252Please note that we don't know for sure what all occurs when a binary is
253loaded because the code is proprietary. We do have some high-level
254documentation from Apple and we can empirically observe what occurs.
255We can also infer what is happening based on the signing technical
256implementation, assuming Apple follows correct practices. But some content
257of this section is speculation and is merely what *likely* occurs.
258
259When a Mach-O binary is loaded, the loader looks for an
260`LC_CODE_SIGNATURE` load command. If not found, there is no embedded
261signature data and running the binary may be rejected.
262
263The associated code signature data is located in the `__LINKEDIT` section
264and parsed so *Blob* are discovered. How deeply it is parsed at this stage,
265we don't know.
266
267Data for the *Signature* slot/blob is obtained. This is the CMS *SignedData*
268structure (BER encoded ASN.1). This structure is decoded and the cryptographic
269signature, signed attributes, and X.509 certificates involved in the signing
270are obtained from within.
271
272We do not know the full extent of trust verification that occurs. But
273Apple will examine details of the signing certificate and ensure its use
274is allowed. For example, if the signing certificate wasn't issued/signed
275by Apple or doesn't have the appropriate extensions present (such as bits
276indicating the certificate is appropriate for code signing), it may refuse
277to proceed. This trust validation likely occurs immediately after the
278CMS data is parsed, as soon as the signing certificate information becomes
279available for scrutiny.
280
281The original *plaintext* message that was signed is assembled. This is
282done by DER encoding the *signed attributes* from the CMS *SignedData*
283structure.
284
285This *plaintext* message, the signature of it, and the public key used
286to produce the signature are all used to verify the cryptographic integrity
287of the *signed attributes*. This effectively answers the question *did
288something with possession of certificate X sign exactly the signed attributes
289in this message.*
290
291Successful signature verification ensures that the *signed attributes*
292haven't been tampered with since they were signed.
293
294The CMS data may also contain *unsigned attributes*. There may be
295a *time stamp token* here containing a signature of the time when the
296signed message was produced. This may be validated as well.
297
298One of the signed attributes is `message-digest`. In this use of CMS,
299`message-digest` is the digest of the *Code Directory* *Blob* data. This
300digest is possibly verified: we don't know for sure. According to RFC 5652
301it should be verified. However, it may not need to be because the digest
302of the *Code Directory* data is stored elsewhere...
303
304A signed attribute contains an XML plist containing an array of base64 encoded
305hashes of *Code Directory* *blobs*. This plist is likely parsed and the hashes
306within are compared to the hashes from the *Code Directory* blobs/slots from
307the *SuperBlob* record. If the digests are identical, it means that the *Code
308Directory* data structures in the Mach-O binary haven't been modified since the
309signature was created.
310
311The *Code Directory* data structures contain digests of code data and
312other *Blob* data from the *SuperBlob*. Since the digest of the *Code Directory*
313data was verified via CMS and a trust relationship was (presumably) established
314with the signer of that CMS data, verification and trust is transitively applied
315to the other *Blob* data and code data (this is effectively a Merkle Tree).
316This means that we can digest other *Blob* entries and code data and compare to
317the digests within the *Code Directory* structures. If the digests are identical,
318content hasn't changed since the signature was made.
319
320It is unclear in what order other *Blob* data is read. But presumably important
321data like the embedded entitlements and code requirement statement are read very
322early during binary loading so an appropriate trust policy can be applied to
323the binary.
324*/