PACER is a public system for access to electronic court records. Wholesale downloads of files from it, initiated by Carl Malamud in the RECAP project, have provided an insight into systematic ways that failures in the court system for the courts and parties to litigation to improperly redact documents.
A study of redaction failures gives a sense for how widespread is the problem of improper redactions in court documents. The specific failure they are looking at is the "redaction rectangle" failure, where a black box is drawn over text in such a way that the underlying text is still in the document. Quoting from author Timothy B. Lee, emphasis added:
So how many PACER documents have this problem? We're in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)
Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.
One document in 10,000 seems like a needle in the haystack, but knowing that you can automatically check a sample document and flag it as a possible redaction failure (with a 1 in 10 chance of finding a real failure) makes this a fruitful place for further study. At the very least, there's the opportunity to flag a document before releasing it or to automatically warn PDF creators that their workflow includes a step that shouldn't be there.
For more on this, I recommend the text Public Access to Court Records - Protecting Personal Sensitive Information, delivered as an ABA Webinar on the topic in March 2011.