Dataset Open Access
Nystrom, Eric C.; Tanenhaus, David S.
This dataset was constructed to connect the rich metadata created by the Supreme Court Database (SCDB) to the Caselaw Access Project (CAP) full-text court opinion data. Since the SCDB includes only substantive opinions, it is necessarily a subset of the full range of opinions available through CAP.
There are two parts to this data: the map connecting each SCDB ID to its corresponding CAP case number, and a more advanced (but error-prone) version in which the authorship of each opinion text identified for the case in CAP is attributed to the Justice who wrote it. Each of these data products have been hand-corrected to the best of this author's ability.
The SCDB->CAP map began as a relatively straightforward automated matching process, based on the US Reports citation for each case as expressed in both SCDB and CAP. Slightly over 80% of SCDB entries found a single CAP data match this way. From there, the data was entirely hand-corrected, with non-matches or duplicate matches individually investigated and manually corrected.
Some SCDB entries simply could not be matched to an appropriate CAP text. Initially, the entirety of US Reports volume 44 was missing, but with the help of CAP staff, the volume was located as having been filed in the New York jurisdiction rather that the United States jurisdiction. The case numbers were then added to the map, but until the volume is relocated to the United States jurisdiction, it may be necessary to also incorporate the New York jurisdiction in full text analysis so that the cases from volume 44 can be searched. 108 more missing cases are from US Reports volume 131, which was a "catch up" volume published in the 19th century. These catch-up cases, many heard by the Supreme Court decades prior, were numbered with lowercase roman numerals instead of the ordinary numbers, which is almost certainly why CAP's software dismissed the catch-up section as prefatory material. Many of the rest of the errors seem largely to be examples where the SCDB project recognized a separate court action that CAP did not. Perhaps most of these seem to have been later rehearings for a case previously decided, which in the 19th century particularly were commonly reported out at the end of the first decision text. While SCDB sometimes gave these subsequent but related actions a separate SCDB entry, CAP seems to have largely incorporated them as part of the text of the main case. Additionally, there were a few that simply could not be found, despite a careful look through each database as well as the original US Reports and sometimes adjacent volumes. Finally, the cases were only matched up through the 2011 court term. After the 2011 term, the mismatches between CAP and SCDB were extensive and frequently seemed impossible to resolve.
Even so, with the manual correction, the overall error rate is low. Of 28,304 cases, only 191 do not have a match, and of those, 108 are contained within the vol. 131 "catch up" volume. Since most of the rest are extremely short subsequent actions that were separately noted by SCDB, the effect of these non-matched cases would seem to be small in most cases.
The typical use case would be that the researcher would generate some kind of results based on searching in the CAP full text, then could use the CAP ID to look up the SCDB ID in the map. With the SCDB ID, of course, the rich metadata from the SCDB can then be connected to each result as needed.
Being able to use the rich metadata of SCDB in conjunction with a case's full text is exciting, but it immediately prompts a further question -- what if the texts could be attributed directly to the Justices who authored them? SCDB produces its data in two forms; one is "case centered," where each record represents one case, and the other is "justice centered," in which each record is the vote of one Justice in one case. CAP, in turn, breaks the total text of the case into distinct opinions, and tries to attribute those opinions to their authors by scraping a string of text from the raw input. Therefore, the challenge was to connect these two sources at the opinion level.
Connecting the opinions, like connecting the cases, involved an initial match by machines, followed by manual correction and revision. In this case, the scope of the manual effort was much larger than that posed by the case-level connection, and more errors were noted in both SCDB and CAP.
The matching process involved a number of steps. First a list of opinions was generated from the CAP data, then matched to SCDB using the SCDB-CAP connector data described above. (Thus, a case without a CAP match in the SCDB-CAP data will not appear in the opinion author data either.) CAP opinions were numbered in the order they were encountered in each CAP case JSON object, and these numbers are used to distinguish the opinions.
Next, a round of automatic matching was performed. If there was only one opinion, and only one author listed in the SCDB data, then the majority opinion author (as listed in SCDB) was safely assumed to be the author. If there was no author listed in SCDB, "percuriam" was recorded as the author in this data. If there were exactly two opinions and two authors, the process was also straightforward, as the SCDB-identified majority opinion author was assigned to opinion 1, and the remaining author assigned opinion 2.
Subsequently, cases with more than two opinions were processed. A potential match (i.e. a "guess") for each opinion in a given case was created by listing each Justice identified by SCDB as having written an opinion in the case. These guesses were then parsed using a semi-automatic procedure with Levenshtein distance fuzzy name matching. With sufficiently conservative parameters, a successful fuzzy match meant that the non-successful guesses for that opinion could be deleted. These sorted guesses were then reviewed manually. Particular care was also taken for any opinion that contained authored opinions by Justices who had similar names (for example, Clark and Black differ by only a single letter). These sorts of cases, as well as instances of co-authorship, were identified and fixed manually.
Those opinions whose authorship could not be matched then were fixed by hand. These included some where the CAP author strings were more complicated than SCDB's strict interpretation; others where the OCR in CAP which contained the Justice name was especially bad; and a number of others where "Mr. Chief Justice" couldn't be directly matched with an author name by the machine. After this light manual correction, almost 500 opinions with substantial errors remained to be individually investigated in depth, by examining the CAP record, the SCDB record, and images of the US Reports for that case. For these last tough customers, errors in the source data were commonly the cause of matching problems. Typically these were of three kinds: examples where CAP should have split the text but didn't (e.g. 2 opinions together in one opinion entry in CAP); examples where SCDB either did not identify or mis-identified an author (such as attributing it to Swayne when it was written by Miller); and examples of non-valid opinions (such as where CAP mistakenly split the opinion too early, leaving an opinion fragment).
For these errors, a system of codes was created in the author field to signal the error type so that researchers can be suitably cautious. The error code is always at the beginning of the field and is followed by a comma and the names of each author, separated by a comma with no space to facilitate parsing. Note also that co-authors are listed as comma-separated names in this same field with no error code. Researchers will probably want to disaggregate this field to create duplicate records with each individual author for most purposes. The justice number field also contains information about all justices authoring the opinion but the error codes have been omitted here.
Data file structure
"scdb_cap-051820.tsv" is a Tab-separated data file containing 5 columns: SCDB ID, CAP ID, US Reports citation, case date, and case name (the latter three from the SCDB data).
"scdb-cap-opinion-authorship_051920.tsv" is a Tab-separated data file containing seven columns: SCDB ID, CAP ID, US Reports citation, case name, opinion number in the case, opinion author, and SCDB justice ID. See above for caveats about disaggregating and error codes in fields six and seven.
It is likely that errors remain in this data, and it is also hoped that some of the errors beyond the author's immediate control might be fixed in the upstream data so that they can be corrected here. Authors would be grateful for error reports, and also reports of errors fixed, if any.
|All versions||This version|
|Data volume||19.7 MB||19.7 MB|