-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.in
278 lines (216 loc) · 11.2 KB
/
README.in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
## Lanark
A specification for restricted dotted names.
## Motivation
Applications such as compilers and package systems often use so-called
[reverse dns notation](https://en.wikipedia.org/wiki/Reverse_domain_name_notation)
to identify packages and code artifacts. Unfortunately, due to reverse DNS
notation being underspecified, each implementation has its own idea of which
names should be permitted and which should be rejected.
This specification attempts to define a restricted form of the notation with the
following properties:
* Names can be validated with a simple regular expression that is defined
in such a way as to avoid resource exhaustion attacks.
* Names are defined using a strict subset of [ASCII](https://en.wikipedia.org/wiki/ASCII)
in order to avoid Unicode-based spoofing and phishing attacks.
* Names are defined such that the maximum length of a name is bounded
in order to provide for predictable storage use in database applications.
## Definitions
### Regular Expression
A _dotted name_ is a string matching the following regular expression:
```
([a-z][a-z0-9_-]{0,63})(\.[a-z][a-z0-9_-]{0,62}){0,15}
```
#### Theorem SPLIT
A _dotted name_ can always be split into between 1 and 16 _segments_
by splitting the name into separate parts at each dot, such that each part is
a valid _dotted name_ .
#### Proof
Let `x` be a _dotted name_. By the definition of the
[regular expression](#regular-expression) that defines a dotted name, `x`
must be one of:
* A single character in the range `[a-z]`.
* A _primary segment_ consisting of a single character in the range `[a-z]`
followed by up to 63 characters from the set `[a-z0-9_-]`.
* A _primary segment_ followed by between `1` and `15` _secondary segments_
that each consist of a dot `.`, followed by a single character in the range
`[a-z]`, followed by up to 62 characters from the set `[a-z0-9_-]`.
If `x` is single character in the range `[a-z]`, then there is no splitting
to be performed and `x` is already trivially matched by the regular expression
and is therefore a valid dotted name.
If `x` consists of a single _primary segment_, then there is no splitting
to be performed and `x` is already trivially matched by the regular expression
and is therefore a valid dotted name.
If `x` consists of a _primary segment_ followed by between `1` and `15`
_secondary segments_, then for each segment `s` it is necessary to show that
`s` matches the regular expression when the preceding dot (if `s` is a
_secondary segment_) is removed.
* If `s` is a _primary segment_, then it already matches the regular
expression.
* If `s` is a _secondary segment_, then it will effectively become the
_primary segment_ of the new dotted name. Because `s` is
a _secondary segment_, it must match the subexpression
`[a-z][a-z0-9_-]{0,62}`. By the semantics of length ranges in
regular expressions, any string matched by an expression `e{0,n}` will
also be matched by an expression `e{0,n+1}`. As the
subexpression for _primary segments_ is `[a-z][a-z0-9_-]{0,63}`, `s`
will match and is therefore a valid _primary segment_.
#### Theorem SIZE
The number of characters in any _dotted name_ is `<= 1024`.
#### Proof
By [SPLIT](#theorem-split), we know that a dotted name `x` consists of
a _primary segment_ followed by up to `15` _secondary segments_.
The regular (sub)expression that matches a _primary segment_ is
defined as `[a-z][a-z0-9_-]{0,63}`. The longest size of a _primary segment_
is therefore `1 + 63 = 64`.
The regular (sub)expression that matches a _secondary segment_ is
defined as `\.[a-z][a-z0-9_-]{0,62}`. The longest size of a _secondary segment_
is therefore `1 + 1 + 62 = 64`.
The regular (sub)expression that defines how many _secondary segments_ may
appear in a _dotted name_ is defined as `e{0,15}`, so the maximum number of
_secondary segments_ is `15` and therefore the maximum number of characters
that can be used for _secondary segments_ is `15 * 64 = 960`.
We can therefore conclude that a string consisting of a maximum length
_primary segment_ and the maximum number of maximum length
_secondary segments_ is `64 + (15 * 64) = 1024`.
#### Coq
Machine-checked proofs of the above propositions are provided in the
[Lanark.v](com.io7m.lanark.core/src/main/resources/com/io7m/lanark/core/Lanark.v) file.
## Rationale
_Why are names defined in terms of a regular expression rather than as a BNF
grammar_?
This specification is being written to support the development of various
[io7m](https://www.io7m.com) software packages, and validation of dotted names
is expected to occur in a wide range of different contexts such as XML schemas,
at runtime in Java code, in definitions of SQL tables, and etc. These environments
all feature regular expression validation, and not all of them support writing
parsers for more advanced grammars. With the specification itself containing
the canonical regular expression, this expression can literally be pasted into
various locations without needing any changes.
_Why are names restricted to a subset of ASCII?_
One of the uses for dotted names is in the naming of software packages published
onto the web. In systems that allow for the full use of Unicode to name
packages, it's possible for malicious parties to spoof the appearance of
packages by using carefully crafted names. For example:
`com.io7m.example`
`com.iọ7m.example`
The second package is a malicious package. It would be fairly trivial for
someone to sneak in a reference to this package as a dependency in an open
source project and have it go unnoticed. For those unable to tell the
difference: The `o` in `io7m` in the second package is actually `U+1ECC`
("Latin Capital Letter O with Dot Below"). This is almost indistinguishable
from the first package, but could easily be used to fool people into thinking
they're installing a package written by someone controlling the `com.io7m`
namespace.
_Won't ASCII cause problems for non-English developers?_
Currently, [Maven Central](https://search.maven.org/) is the largest collection
of open-source software on the planet. Artifacts published to Maven Central
have a _group name_ and an _artifact name_. It is conventional for _group names_
to be in reverse DNS notation, and it is not uncommon for _artifact names_
to also be in this same notation. By analyzing the largest collection of
of open-source software on the planet, we can probably get some idea as to how
developers all over the world are naming their artifacts.
An [index](https://repo1.maven.org/maven2/.index/) is published weekly consisting
of a list of every single artifact published into the repository. By analyzing
the names of artifacts and groups and checking to see if those names could be
expressed using the restricted dotted name specification here, we observed
the following:
* There are `69604` unique group names on Maven Central. Of these,
`68690` have names that are expressible using the syntax defined
here. This leaves `914` inexpressible groups, for a coverage of
`98.69%`.
* There are `431423` unique artifact names on Maven Central. Of these,
`343678` have names that are expressible using the syntax defined
here. This leaves `87745` inexpressible names, for a coverage of
`79.66%`.
However, we also analyzed the reasons that names failed to match the syntax
defined here and determined:
* `618` and `24810` group and artifact names, respectively, failed to
match because they contained uppercase characters. If all names are
converted to lowercase, this removes a significant chunk of "bad"
names.
* `62696` artifact names failed to match because they contained characters
other than `[a-z]` after a dot. These were frequently artifacts that,
for whatever reason, decided to encode version numbers within the name
itself. A random sample of failing names is as follows:
```
com.github.javawithmarcus.wicket-cdi-1.1
org.floggy.3rd.org.eclipse.core
com.github.1137095129
io.github.2gis
com.github.9215095360
com.9isuper.eve
org.99soft
com.moz.kiji.delegation.kiji-delegation.3.0.0.com.moz.kiji.schema
org.floggy.3rd.org.eclipse.ui
io.7mind.izumi.sbt
opentelemetry-armeria-1.0
common-util_2.13
mongoauth_3.1_2.12
content-api-client_2.12
utils-test_2.12
kafkakit_2.13
ciris-refined_2.11
dynamo-test_2.13
case-service_2.12
jimcy-java-api_2.11
```
* Only one single artifact name used a non-ASCII character on Maven
Central: `com.github.marcioos:bgg-clienẗ`
* Only `69` artifacts contained name segments that were too long to
be supported by the syntax defined here.
A random sample of failing names is as follows:
```
rapidpm-proxybuilder-modules-dynamicobjectadapter-generator-processors
rapidpm-proxybuilder-modules-objectadapter-generator-usages-usinggenerated
spring-cloud-starter-stream-processor-tasklaunchrequest-transform
stormpath-sdk-examples-spring-security-spring-boot-webmvc-bare-bones
camel-quarkus-integration-tests-support-custom-type-converter-deployment
wildfly-microprofile-reactive-streams-operators-cdi-provider-legacy-namespace
camel-quarkus-integration-tests-support-custom-type-converter-parent
stormpath-sdk-tutorials-spring-boot-default-spring-security-refined
nav-virksomhet-tiltakOgAktiviteterForBrukere-v1-meldingsdefinisjon
camel-quarkus-integration-test-support-core-main-collector-ext-deployment
```
In the author's opinion, these names are somewhat excessive and could
be supported with dotted notation instead of relentless hyphenation.
* Less than `20` artifacts had any other characters that do not appear in
the regular expression defined here. A random sample is as follows:
```
# These contain ':'
libaums:storageprovider
com.foilen:database-tools
libaums:http
reactivex:rxjs
app.ubie:brave-kt
# These contain '+'
bctsp-jdk15+
bcpg-jdk15+
sugar-tms_2.12at13+
mvp+android
bcprov-jdk15+
amiitool+android
bcmail-jdk15+
# These contain whitespace
com.inkapplications.spondee.math-macosx64.0.0.3.com 2.inkapplications.spondee
com.inkapplications.spondee.math-macosx64.0.0.3.com 3.inkapplications.spondee
com.inkapplications.spondee.math-macosx64.0.0.3.com 4.inkapplications.spondee
utilex
# These contain quote characters
"palsolayouts"
"android-sdk"
"rxbluetooth"
```
Many of these look like publication mistakes.
It is therefore the position of the author that if people are publishing packages
with non-English names, they appear to be doing it using the ASCII character
set.
_Why are the length of names bounded?_
For two reasons:
* Regular expressions can be subject to [denial of service](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)
attacks, particularly when they contain unbounded quantifiers such as `*`
and `+`.
* Adding an upper bound on length means more predictable storage use when
names are used in relational databases.
The regular expression as it is defined here is expected to be somewhat
less vulnerable to denial of service attacks in naive regex engine
implementations than an unbounded version would be.