The title is a riff of (Finn, n.d.).
In the real world there are many reasons why a data point would be absent from a dataset:
- Not collected
- Not collected for reasons
- Not observed
- Thrown out for reasons
- Anonymized for reasons
If I have an some values associated with time, like
Year | Data A |
---|---|
2000 | 123 |
2003 | 456 |
2004 | 789 |
I want to combine this with yearly data of another data set to do some analysis:
Year | Data A | Data B |
---|---|---|
2000 | 123 | ab |
2001 | cd | |
2002 | ef | |
2003 | 456 | gh |
2004 | 789 | ij |
Is Data A missing years 2001 and 2002? Is it missing 1995? How about 1980 or 2008?
How is this missing-ness encoded in the data methodology, data sets, software, and programming languages?
Languages and their Nothings
Language | Syntax | Implementation | Meaning |
---|---|---|---|
IEEE 7541 | NaN | Value(s) of a floating-point number | Not a Number |
Python | None | Object, Singleton of NoneType | Absence of a value2 |
Python | nan | Float value | IEEE NaN |
Python - Pandas | <NA> | Nullable Integer | Proxy for IEEE NaN3 |
Julia | NaN | Float value | IEEE NaN |
Julia | missing | Value, Singleton of Missing | Missing value in statistical sense4 |
R | NA | Value, Instances for multiple types | Missing value in statistical sense5 |
SQL | NULL | Marker for absent value | Absence of a value, Missing or Inapplicable information |
C/C++ | NULL | Preprocessor macro (implementation-defined) | Pointer that does not point to a valid object |
C/C++ | nullptr | Singleton of nullptr_t | Pointer that does not point to a valid object |
Haskell | Nothing | Value of Maybe a | Optional value, used for errors or exceptional cases.6 |
Rust | None | Value of Option<T> | Optional value, used for default values, errors, nullable pointers7 |
There’s a saying that programming is just manipulating data. But does that really apply to the statistical and experimental interpretation of “data”?
Bonus: Default function arguments, and the caller does not supply anything
IEEE 754
Section 6.2
Quiet NaNs should, by means left to the implementer’s discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results. To facilitate propagation of diagnostic information contained in NaNs, as much of that information as possible should be preserved in NaN results of operations.
“IEEE 754 Error Handling and Programming Languages”
The definition “max(1.0, NaN ) = NaN ” is correct when a NaN is a missing value and what is wanted is the maximum non-missing value of a vector (as in one expression mode in many statistical packages) but is mathematically incorrect when it is an error state (as generally in IEEE 754)
Appendix A considers some potential interpretations of NaN
:
- A. A missing value (i.e. unknown but valid)
- B. Not numeric at all (e.g. ‘purple’)
- C. Inapplicable (i.e. not a datum)
- D. Numerically indefinite (e.g. ≈ 0/ ≈ 0)
- E. The result of an invalid operation
Footnotes
-
Okay, not a language but still significant. ↩
-
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#NA-handling ↩
-
https://www.haskell.org/onlinereport/haskell2010/haskellch21.html#x29-25500021 ↩