-
Notifications
You must be signed in to change notification settings - Fork 3
feat: Allow nullable primary keys #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #14 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 37 +1
Lines 1788 1807 +19
=========================================
+ Hits 1788 1807 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the immediate contribution following your issue @gab23r, great to see that you're keen on making dataframely work for your use case 😁
I thought about the fact that we're not allowing NULLs in primary keys again and I wanted to ask: are you actually interested in allowing NULL value for primary keys or do you use primary_key=True
as a "workaround" for unique=True
?
In my mind, this is slightly different. Polars has no concept of a "primary key" so it makes sense that it doesn't treat NULL values specially e.g. when using n_unique()
or group_by()
. However, a NULL value generally expresses "I don't know the value". Hence, it does feel odd to allow NULL values in a primary key as an "unknown value" is typically not a good identifier for a row (after all, NULL == NULL
also evaluates to NULL
) 🤔 happy to hear your thoughts @AndreasAlbertQC @delsner
I think that is one half of it. The other half is that NULL means "no match found" when joining. If I think of a lookup table, a row with primary key being NULL would be the equivalent of just not having a row in the lookup table and then having to
I think the big implication is that performing a join on this primary key will not return rows, which makes my paragraph above obsolete. In practice, particularly the case of having a |
I agree that having nullable primary key feels odd, but in my case I need to validate data from a supplier and I do not really control the content of the table. Unique=True wouldn't make the job. As my table has multiple primary keys ( the column can have multiple nulls ) |
I think one should be able to implement this pretty straightforwardly by using a custom rule that enforces the composite uniqueness across multiple keys. I feel like this solution might be preferable because it avoids some of the murkiness discussed above: class MySchema(dy.Schema):
x = dy.Integer(nullable=True) # part of composite unique key
y = dy.Integer(nullable=True) # part of composite unique key
z = dy.Integer(nullable=False)
@dy.rule(group_by=["x","y"])
def composite_unique():
return pl.len() == 1 While testing this, though, I just noticed that in the special case of null x/y values, this does not appear to work as expected, which I think is a bug: # Should not pass validation
df = pl.DataFrame(
{
"x":[None,None],
"y":[None,None],
"z":[5,6],
}
)
# Passes validation!
MySchema.validate(df,cast=True) Tracking this separately as #15 |
That would work indeed. I have no strong feelings about it even if I don't understand why we couldn't allow the user to have nullable primary keys if he explicitly asks for it. Are there technical things blocking for that ? |
@gab23r we have released #16 in v1.2.1 which allows for the custom workaround proposed by @AndreasAlbertQC. I think the simple fact that joins do not, by default, consider That being said, I think we should still fail hard if the user provides both |
I understand. Raising an error would be better indeed. I can do a PR. |
fixes #13
Motivation
Allow the user to define a nullable primary key column.
Changes
nullable
will default tonot primary_key
is not provided by the user, otherwise we keep user choice.