To whom it may concern at the Tech Support Team of StataCorp.

The following behavior of Stata 11.0 was encountered during practical work, which in my opinion is a bug.

A large data file benin06.dta (50MB size) was (for completely unrelated reasons) split into a set of 1.44MB
floppies. Each chunk of the file was 1,457,664 bytes long except the last one. In the bug replication
log this is denoted as benin06one.dta

A person not familiar with the procedure (combining the files back into the single file) started working
with the first chunk only. After loading the chunk into Stata:

-- Stata didn't complain that the file is incomplete, no error message, warning, confirmation, etc;
-- Stata reports _N observations same as in the complete file;
-- Stata reports "plausible" values in the data points beyond the one actually read; using other tools
was determined that about 2,460 observations were located in the first chunk of the file. I have
created a bug-replication script that tabulated the values of one variable for n>30,000 to see what's
there.
-- Original file contained value labels - these were silently removed.

Absense of the check of actual data read against the information stored in the header creates a 
situation which is rather non-trivial to troubleshoot, simple checks like "are you using file Benin06.dta?"
"is the number of variables 459?", "is the number of observations 90,650?" do not help here,
(disappearance of the value labels helped).

I would understand if the rest of the memory allocated, but not populated from the file, contained
garbage, but the fact that the values are session-independent (checked by launching several Stata
sessions in parallel) and "plausible" remains a complete mystery for me. Does Stata do anything to
"impute" the values that can't be read from the disk? Will it also do something if, say, a disk sector
fails? (this is very dangerous, I hope the OS will declare the whole file inaccessible then).

A side question is whether anything similar may happen with aborted downloads? or if the copy A B
succeeded, then I may be sure that B is identical to A?

I've managed to reproduce the problem with other datasets, so the particular content of the dataset
Benin06.dta is irrelevant here.

May I request that Stata checks the amount of data actually read from the file versus the amount of
data declared in the file header and aborts with an error if these do not match? In the more complete
solution I envision an option ",incompleteok" or ",repair" to salvage whatever data is available (with a
warning).

I am familiar with data archiving, checksums and other possible solutions here, but the one I suggest
is an obvious, trivial, and does not require anything preparatory to be done by the user (like computing
checksums).

Thank you very much, Sergiy Radyakin


Attached two files: bug.txt and bug.do