-
Notifications
You must be signed in to change notification settings - Fork 19
Expectations 0.6: Framework for specifying expectations in terms of timings and verdicts #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@thorehusfeldt Could you elaborate on and give examples of the syntax of the This seems to only talk about the "expectations" and not how or (at least not very explicitly) on what they are applied to. Specifically:
|
This is treating the time check exactly like the verdict. I.e. literally splitting out the old (v0.2?) verdicts into new (v0.3) verdict, timing pairs. E.g. I don't think this is the right approach (and, I would claim that this is the same approach as the v0.2, since it's just a trivial split). I don't think we should be setting explicit requirements on the timing states in this fasion. Could you give some examples of where this would be either needed, or at least very useful? |
We were also talking about this regarding DOMjudge specifically. @meisterT wrote his thoughts after our discussion at: Interestingly, he also comes up with this idea of new verdicts for when submissions are inside the |
There are quite a few things going on here. We want to specify:
Let me try to go over this. 1. What do
|
directory | permitted results | required results | used for timing |
---|---|---|---|
accepted |
AC | AC | yes |
wrong_answer |
AC,WA | WA | no |
time_limit_exceeded |
AC,WA,TLE | TLE | yes |
run_time_error |
AC,WA,TLE,RTE | RTE | no |
- I'm ignoring the edge case of zero testcases for
accepted
. - /Used for timing/ means that the max running time should be outside the safety
window[time_limit / ac_to_time_limit, time_limit * time_limit_to_tle)
. In the
table below, that corresponds to disallowingAC-
andTLE-
as required
verdict in that row.
Here is the same table using Thore's AC-
/TLE-
notation:
directory | permitted | required |
---|---|---|
accepted |
AC, |
AC (n.b. not AC-) |
wrong_answer |
AC,AC-,WA | WA |
time_limit_exceeded |
AC,AC-,WA,TLE-,TLE | TLE (n.b. not TLE-) |
run_time_error |
AC,AC-,WA,TLE-,TLE,RTE | RTE |
BAPCtools
- Does not yet implement the
problem timing
/safety margins. - Wrongly gives
RTE
andTLE
the same 'priority'. The spec above requires
thatTLE
submissions are run on all cases to ensure they do notRTE
,
but that would 'take forever' in practice.
Problemtools
TODO
- Question: Does it verify
time_limit_exceeded
submissions never giveRTE
???
DOMjudge
- Depends on verdict 'priorities'. Typically bails on first error found (which
can be different from the numbering due to parallellism), meaning all
non-accepted
directories get treated wrongly for submissions that give more
than one non-AC
result.
2. What do we want submissions/*
to mean?
Observations:
- DOMjudge does not work well with 'mixed' results (i.e. multiple non-
AC
results). - Because of this, BAPC/NWERC jury use
submissions/rejected
when more than one
thing non-AC
result may occur.
Ragnar prefers being more strict:
Make time_limit_exceeded
and run_time_error
not allow other results, i.e.:
directory | permitted results | required results | used for timing |
---|---|---|---|
time_limit_exceeded |
AC, |
TLE | yes |
run_time_error |
AC, |
RTE | no |
Should wrong_answer/
and run_time_error/
be used for time limits?
(Equivalent: Is it OK for wrong_answer
and run_time_error
submissions
to be inside the safety margin?)
For simplicity Ragnar prefers the current status of not using them for time limits.
Let's say we would use wrong_answer/*
and run_time_error/*
for time limit determination, would
we then take the maximum over passing testcases, or over all testcases? I would
say we should include all cases. (Note that I think this would require an
additional WA-
and RTE-
verdicts to specify?)
3. What new default directories do we want?
Thore already suggested does_not_terminate
and not_accepted
(what we in
NWERC call rejected
so far). We also use mixed
, which is for submissions
that may or may not time out or crash (but do not WA
).
directory | permitted results | required results | used for timing |
---|---|---|---|
does_not_terminate |
AC,TLE,RTE | TLE,RTE | no |
not_accepted (rejected ?) |
AC,WA,TLE,RTE | WA,TLE,RTE | no |
mixed |
AC,TLE,RTE | AC,TLE,RTE | no |
4. YAML Syntax
(Q1) question: do we allow regex and/or glob in the submission path and
testcase path? The below assumes glob only.
used_for_timing
My proposal is to add a used_for_timing: bool
field. Not sure whether true
or false
should be default. I used false
as a default for now.
true
: the verdict must not be in[time_limit / ac_to_time_limit, time_limit * time_limit_to_tle)
. This can either be verified for a given time
limit, or can be used as additional constraints (inequalities) the
to-be-determined time limit must satisfy.false
: no additional restriction on timelimit.
(Q2) question: What do you think of used_for_timing: bool
? Is it clear and
intuitive enough? Is it expressive enough? E.g. it does not allow setting a
lower multiplier for some unreasonably slow (but still should-be-accepted) submissions.
Example YAML This is only to show syntax. Specific values can be discussed.
The implied default submissions/expectations.yaml
is something like:
# Or just `accepted`? or `accepted/`?
accepted/*:
permitted: ["AC"]
required: []
used_for_timing: true
wrong_answer/*:
permitted: ["AC", "WA"]
required: ["WA"]
time_limit_exceeded/*:
permitted: ["AC", "TLE"]
required: ["TLE"]
used_for_timing: true
run_time_error/*:
permitted: ["AC", "RTE"]
required: ["RTE"]
does_not_terminate/*:
permitted: ["AC", "TLE", "RTE"]
required: ["TLE", "RTE"]
rejected/*:
permitted: ["AC", "WA", "TLE", "RTE"] # NOTE: This could be omitted.
required: ["WA", "TLE", "RTE"]
mixed/*:
permitted: ["AC", "TLE", "RTE"]
required: ["AC", "TLE", "RTE"]
or using abbreviations the below would be equivalent:
accepted/*: accepted
wrong_answer/*: wrong answer
time_limit_exceeded/*: time limit exceeded
run_time_error/*: runtime excepted # TODO: Naming of this
does_not_terminate/*: does not terminate
rejected/*: rejected
mixed/*: mixed
Further it is possible to specify verdicts for testgroups and make assertions on
the score of a submission and a judgemessage
that must appear for at least one
testcase:
# AC that is not used for timing purposes and may be close to the timelimit.
sluggish_ac/*:
permitted: ["AC"]
required: []
# used_for_timing: false # (default value)
wrong_answer/caught_by_samples.py:
sample:
required: ["WA", "TLE", "RTE"]
# Or equivalently:
wrong_answer/caught_by_samples.py:
sample: rejected
# Or more precisely
wrong_answer/wa_on_samples.py:
sample: wrong answer
# Note that this also still checks that the submission is WA (and not RTE/TLE) on all other cases because of its directory.
# Maybe we want to also have 'hidden' edge cases only exposed in secret data:
wrong_answer/wa_on_secret.py:
sample: accepted
# TODO: Exclude a single submission from time limit determination:
# (How do we do overriding/inheriting when multiple globs/regexes match a submission? )
accepted/kinda_slow.kt:
permitted: ["AC"]
required: []
used_for_timing: false
YAML integration for @niemela, even though this is not the current focus. (We’re still at what to specify, not how to specify it.) The simplest accepted: accepted
wrong_answer: wrong answer
time_limit_exceeded: time limit exceeded
runtime_exception: run time exception One could say that this
In particular, a problem development system can completely ignore all YAML, as long as it treats the submissions in Aside from pinning down the spec, the definitions above also allow bespoke directories such as the following for “all the sluggish java submissions” other_ac:
permitted: ["AC"] or more fine-grained specifications, such as Thore promising that his wrong_answer/thore.cpp:
sample: permitted_verdicts: ["AC"] You can go arbitrarily crazy with this and specify the behaviour of individual submissions on individual testcases (or groups):
The matching rules (do we prefix-match, glob, regex, wildcard) are currently in flux and downstream from us finally agreeing on how to specifiy |
@RagnarGrootKoerkamp : thanks for the table. I think the second table should not have |
@RagnarGrootKoerkamp : I played around with
as a single concept. If you’re kinda confident that no terrible things will happen, I can live with the single boolean field. |
Yes. But we should only do it if it has a significant benefit.
What is that edge case? Zero
Another reason to use this is that you just don't case how it fails, only that it fails. I think it makes sense to have such a directory.
I think I agree. Especially if we also have
If I understand the canonical use of The CLICS name for I don't love the name
Glob only sounds good to me. Is there an exact definition/specification that we could use? I couldn't find one immediately.
I think that I'm not really sure if it makes sense to have
I'm not convinced that the abbreviations are needed.
Up until we introduced accepted/*:
permitted: ["WA"] Because With This means that the example a bit above could be written as: accepted/kinda_slow.kt:
used_for_timing: false
The simplest file, would be the empty file, which would then not have to exist. I would suggest that the semantics should be that the default directory behavior is assumed to be added to the top of the file. I.e. the |
@thorehusfeldt I don't understand what you're saying here, so I can't be confident that it wont happen. Could you explain more? |
As I understand it, the edge case where there are no testcases within a test group. |
Ah, I realize now that I read "submission" and not "testcases". This makes much more sense. Thanks :). That said... is that an edge case? Every "permitted" rule will always succeed on an empty set, (and every non-empty required rule will always fail). So any accepted submission will always succeed on the empty set of testcases. That feels right, right? (Useless of course, but right). |
Great progress. Let me try to pin down First, let me change the name to In particular, there could be traditions where the With that out of the way: (recall that an expectation is something a submission has on a set of
This is the best I can do, pay particular attention to the — The alternative is to separate the concepts “all |
These situations happen all the time, for instance when you run the |
I don't think the definition should be over whether |
I think this means that you agree that it is correct? |
You correctly observe that a symmetric concept like “we want to use this for time-limit-y stuff” is ill suited to express two intentions (“we want However, I try to be constructive and help by at least attempting to write down what |
For
That's also my weak preference.
Yes, this is also my feeling.
I agree this is nice, but sadly YAML dictionaries are explicitly unordered :/ We'll have to think about this more.
I think that works for me. Give me a day to get used to it :)
Ok so the way I initially think about Note:
Examples:
Let's think about how this interacts with:
1.
|
Agreed. (But also, leaving out
Right, you defined it as: mixed/*:
permitted: ["AC", "TLE", "RTE"]
required: ["AC", "TLE", "RTE"] This seems to me to be the same use case (but ever so slightly different definition) as what @thorehusfeldt calls
Right, I knew that... :(. That's very annoying though.
FTR, I think this is the right approach for
Do we actually need to allow |
They are quite similar indeed. The only distinction is that But I agree that
yes, I was thinking of something similar.
If you have 3 testgroups, A, B, and C, I can imagine some submission that should pass A (with margin), fail B (with margin), and fail C (also with margin). The running time could explode on C, but you would still want that on B it also fails with margin. I'm not sure if this 100% makes sense with scoring problems though. There you may want to ensure that the score for group B is 0, and that it times out for each testcase in group B? One way we could do that is to distinguish In the submission.py:
secret/group_b:
permitted: [TLE] # Note: not TLE- whereas now this becomes submission.py:
secret/group_b/*:
permitted: [TLE]
required: [TLE]
with_margins: true |
Here is a case where mixed/brute-force.py:
sample: accepted
secret/group1: accepted
secret/group2: time limit exceeded Here it is crucial that required_verdicts: ["TLE"]
required_timings: ["too slow with margin"] |
How about this? # file glob specifying some set of files. (1)
<file glob>:
# the calculated verdict for the submission (2)
verdict: [<verdicts>]
# score of the submission (3)
score: <number or range>
# permitted and required testcase verdicts (4)
permitted: [<verdicts>]
required: [<verdicts>]
# these submissions are used for timing (5) (6)
used_for_timing: <bool>
# testcase glob specifying some set of testcases or groups. (7)
<testcase glob>:
# score on the testcase or group (8)
score: <number or range>
# permitted and required testcase verdicts (9)
permitted: [<verdicts>]
required: [<verdicts>]
# the result on this testcase or group of this submission is used for timing (10)
used_for_timing: <bool>
# multiple sets of testcases or groups can be specified
<testcase glob>:
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool> Verdicts are "AC", "WA", "RTE", "TLE". (1): Since a program is "always represented by a single file or directory" we can't (or shouldn't?) allow submissions that are not exactly one directory level down. I.e. (2): Set of verdicts. The verdict of the submission must be one of them. Note that this does not directly say anything about verdicts of testcases, and allow judging to be short circuited if that's how the problem is specified. (3): Score for the submission. If a number must be the exact score given, if a range must be within the range (i.e. >= the lower end and <= the upper end). Note that a non-0 score implies that verdict is "AC". How should matching on score behave when verdict is not "AC"? Should the score always be 0 in that case, should it be allowed to be 0, or should it be "null"? (4): All verdicts off testcases must be in the permitted set, some verdicts of testcases must be in the required set. Default for permitted is (5): That a submission is used for timing means that it must be possible to set a time limit such that the time for the submission (i.e. the maximum of the times for testcases) is not within any of the margins from the time limit. Note that if the time limit is explicitly set, than that is the only time limit that can be set. Defaults to false. I think this is the same as what @RagnarGrootKoerkamp said? (6): Alternatively we could have some kind of (7): If these are just any glob than (8): Score for the testcase/group(s) specified by the glob. (9): Similar to (4), but applies separately to every testcase/group specified by the glob. (10): Similar to (5) and (6) but applied to a single testcase/group. Note that (5), (6) vs (10) implies that the following are different: accepted/sol1.cc:
used_for_timing: true
accepted/sol1.cc:
*/*:
used_for_timing: true |
@niemela nice writeup! (1): I'm OK with allowing none and arbitrary nesting. (2): I think Thore explicitly steered away from assertions on the final verdict, and given (3): I'd say non-AC should simply have (4): Good point re. (5): yes sgtm (6): My feeling is that for now it's better to not go with explicit overridable margins yet, since one can always adjust the problem-wide margin if things are wrong. One thing that we could do using more granular margins is e.g. setting a (7) Yes it should start with (8,9,10) Hmmmm, the grouping/globbing is quite tricky:
I think this creates some extra complexity. We should think about if we really need this. If we do want it, it's probably better to store this alongside the existing fields rather than overloading the glob field. (11): Looking at Thore's last post above, I do think abbreviations are quite nice, especially when used with testgroups. Maybe we need an " |
You can; my proposal is supposed to exactly make this possible. For a realistic example, consider thore.*: wrong answer
*.py:
sample: accepted The submissions |
Only tricky for scoring problems, and then it gets really tricky indeed. But only some of us care about scoring problems, so I’ve decided to put very little emphasis on that. |
Very cool. Does this really work (if so: great)? Let me try to rephrase. Again,
Is that what we mean? |
thore.*: wrong answer
*.py:
sample: accepted What if we want to apply both of these rules only to thore.py: wrong answer
thore.py*:
sample: accepted may be needed. (But I'm happy accept this as a niche case that's not very pretty.)
Yes :) |
We (@Tagl @RagnarGrootKoerkamp @thorehusfeldt and myself) had a fairly long and somewhat productive (IMO) call to discuss the details. Here are some things I noted from that meeting. Not in order of importance, only numbered for reference: We agree that submissions must be always one directory level down, since otherwise we can't really tell directory full of submissions apart from a directory that is a submission. I.e. these are all acceptable submissions: The rules for combining The rules for combining settings of Examples: accepted:
used_for_timing: true
accepted/sol1.cc:
used_for_timing: false
# this is ok, everything in accepted except accepted/sol1.cc is used for timing.
---
*/*.py:
used_for_timing: false
accepted/really_fast.py
used_for_timing: true
# not ok, accepted/really_fast.py is given contradicting values. We agreed that setting a tuple of
We disagreed quite a bit on what the "testcase globs" could refer to, just testcases, or also test groups. I.e. does We disagreed quite a bit on what the semantics got rules on "testcase globs" would be. Applying the rules individually on all test cases (or groups?) would be consistent with how the filename globs work, but applying the rules once to the set of all test cases matched would be more powerful. Taking 7 & 8 together, maybe we should have separate ways of targeting testcases and test groups? (I think that's more complication than it's worth). |
I talked some more with @Tagl, he came up with this idea for making the testcase matching just different enough that we could use the more powerful semantics (discussed under 8 above). Instead of (something like) this: <file glob>:
verdict: [<verdicts>]
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
<testcase glob>:
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
<testcase glob>:
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool> we could do: <file glob>:
verdict: [<verdicts>]
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
testcases:
- matches: <testcase glob>
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
- matches: <testcase glob>
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool> Thoughts? |
A lot of progress was made! |
Yes, it should absolutely be written (in the YAML file) as
No, I don't think so?
Yes, thanks for catching that. I have fixed the typo in the comment above.
Well, this was apparently a point of a lot of divergence. I would argue that "verdict aggregation to submission level" was always part of the format, and furthermore that it absolutely, positively must be. If we don't even agree on what the correct judgement is, it's not much of a standard it is? OTOH, there is a weaker form of agreement we could aim for where we must agree on accepted vs rejected, but are allowed to possibly reject differently (because that matters a lot less).
I think so, yes. If I understand the counterpoint correctly, the testcase rules are supposed to match on sets of testcases, and never groups. With that view, then
Yes. 👍 |
The most dramatic change that came out of the meeting yesterday – nullifying three months of exchanges in various channels, the consensus of the August meeting in Lund, and all my concrete, thorough, and complete proposals since August – is that we discovered that there is no consensus about the following claim:
Instead, expectations are now something much richer and also express constraints on aggregated verdicts. (These are themselves defined in terms of aggregation rules for sequences of testcase-verdicts.) Thus, a However, these changes were suggested 38 minutes before the meeting yesterday, and I haven’t had as much time (compared to the three 3 months we’ve discussed proposals 0.1, 0.2, and 0.3) into thinking that through. I will try to pin down what I think we mean with this and re-emerge with expecations 0.4. |
How can that be hard to find out? I quote myself from expecations 0.3 at the top of this page: #expectation: {
permitted_verdicts?: [...#verdict] // only these verdicts may appear
required_verdicts?: [...#verdict] // at least one of these verdicts must appear
....
} BTW, these fields will now probably be renamed to |
How does this proposal relate to #112? |
It is the third or fourth iteration. |
Does this mean the other issue is obsolete and can be closed? Or does this add on top of the other version? |
This claim confuse me. Aggregated verdicts are themselves "constraints on sets of testcase-verdicts" so constraints on testcase-verdicts and aggregated verdicts are still constraints on sets of testcase-verdicts? Or is the key insight here the difference between sets of testcase-verdicts versus sequences of testcase-verdicts? |
Yes, the section you quote is what I based my understanding of. The source of the confusion is the suggestion that there is some set theoretic combination of You say:
How does multiple |
Yes, I have closed it.
In the sense that all our discussions build on top of all our earlier discussions and insights, it kinda does "add on top". That said, closing a ticket does not delete it, we can still read and refer to it. (And even reopen if we feel the need). |
They are not; the important word here is “set”. Aggregated verdicts are constraints on totally ordered sequences of testcase-verdicts. (In particular, they are defined in terms of which non- And aggregate scores are something even richer: they are constraints on trees of verdicts, modified by the aggregation rules of Sets are not totally ordered sequences are not partial orders. Note for instance the very first paragraph in my own introduction from yestersday:
This was literally my opening sentence yesterday, and I highlighted the choice of the word “collection” (to encompass all three cases) verbally. You may be underestimating my ambition for precision. :-) |
I would maybe say that aggregated verdicts are constraints on sets of testcase, verdict pairs. testcases themselves are, of course, ordered. |
So you did say that wanted to "specify the expected behaviour on [...] lists [and] trees", but that now that we do that is "nullifying three months of exchanges..."? I'm actually very confused. |
A submission must satisfy all the expectations that apply to it. (For instance, all the expectations that match – by whatever rule we can agree on – its filename. Or are specified in the source code. Or hardcoded into the specification. Or… it’s not important.) So (by whatever rule) assume submission
|
No, I believe that's incorrect. Let's say expectation E is What am I missing? |
You're right, I'm wrong. Thank you. My first semantics is the intention: A submission must satisfy all expectations that apply to it. |
Closing this. Discussion continues in #137 . |
_This is version 0.6 of the expectations proposal. _
Goals
Primary Design Goals
Precisely specify the meaning of “verdict” and “expectation”
Define the semantics for setting time limit within “safety” margins from judge submissions
Provide clear semantics for expectations of author submissions placed subdirectories of
/submissions/*
Secondary Design Goals
Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.
Expectations for judge messages (“on some testcase, the submissions produces
guess out of bounds
”) so that we can specify full test coverage for custom output validatorsExpectations for scores (”this submission gets at most 65 points on all testcases”)
Allow expectations to be specified for test groups like
sample
orsecret/group1
or evensecret/03-petersen_graph
Provide a schema for expectations specified in a separate
expectations.yaml
fileConstraints
Out of scope:
Givens
Relevant metadata is given in
problem.yaml
specified as followsFor readability, the present document introduces aliases to the metdata specified in
#problem
:Proposal: Verdicts and timings
Result
For precision, we define the result of running a submission on a testcase for a single-pass problem.
Decisions made here, this is RfC2.
We use here the term “result” (not “outcome” or “verdict”).
time!
is not optional. Every result has atime
, even those that are aborted or crashed. We promise that time is never compared to anything abovetle_margin * time_limit
, so any value above that is the same.Careful distinction between successful termination (of the process) and successful output validation; note the required field
output_validated!
: every submission thatterminated_successfully
has its output validated. Note that this means “the output validator accepted” (it returned42
).Scores are nonnegative
The judge message is called
message
. Alternative:judge_message
.Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.
Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.
Verdict
The
#verdict
is a shorthand of the#result
.There are four verdicts:
However, the behaviour of contest systems is beyond the scope of this specification. The present write-up focusses on verdicts used for specifying expectations.
CE
,MLE
, orJE
are not verdicts. (Making them verdicts requires a richer#result
. No author submissions are ever expected to produceCE
orJE
, andMLE
is very hard to specify.)What verdicts mean
We can specify the semantics of each
#verdict
quite readably in CUE:Timing
Like
#verdict
, the#timing
is a discrete summary of the result:This discretises the value of
result.time
in terms of the three boundaries provided byproblem.limit
in the obvious fashion, we can specify this precisely in CUE to be clear about which inequalities are strict:Proposal: Expectations
The role of expectations is to
during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.
define which author submissions are used to set the time limit
Expectations defined
An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts, timings, messages, and score ranges:
A set of results satisfies an expectation if
for each result, the result's verdict/timing is contained in
permitted_verdicts/timings
at least one of the verdicts/timings in
required_verdicts/timings
appears among the verdicts of the set of resultsthe
message
appears (as a substring) in at least oneresult.message
among the verdicts of the set of resultsevery
result.score
is insideexpectation.range
Common abbreviations
Typically a problem author will not use the fine-grained
#expectations
struct defined above, but instead use a common abbreviation:The above definitions are meant to define the expectations for submissions in
/submissions/accepted
etc. Here are two additional, useful abbreviations:The text was updated successfully, but these errors were encountered: