Duplicate Handling

How we identify, deduplicate, and present overlapping audit records from multiple sources

How Deduplication Works

Whenever we import a record, we create a fingerprint for it—an MD5 hash of the complete record. If a future import generates the same fingerprint, we don’t store the record again. Instead, we track the fact that it’s available from more than one source.

Whenever you see “2(+) sources” in a result, that’s what happened: two or more agencies released audit data containing an identical record.

Expected vs. Actual Overlap

For every search of the national network, you would expect that search to appear in every participating agency’s audit log. That’s what Flock claims, and that’s what our system was designed to accommodate.

In practice, this doesn’t happen. Taking June 2025 as an example: there were 1,907,643 unique fingerprints for that month, but the most records from a single source was 382,998. This means either no single source shows all national network searches, or sources show different information about the same searches. Both appear to be true.

Partial and Contradictory Records

Because our fingerprint is based on the exact text of the record, changing even one letter changes the fingerprint. This is the technically correct way to treat audit records: any tampering is guaranteed to be immediately evident.

However, when Agency A reports a search with the reason “Investigation” and Agency B redacts that reason to “Inv” or blanks it out, our system sees two mathematically distinct fingerprints. It treats them as two separate searches, even though they likely represent the same event.

We also regularly see searches logged differently between organizations. These variations range from small changes in organization names (“University of Iowa PD IA” vs. “University of Iowa (IA) PD”) to completely different operators or organizations.

Our Approach

When faced with records that may or may not be duplicates, there are three options:

  1. Discard the record entirely.
  2. Try to determine if it’s a duplicate based on available information—a task that is already virtually impossible even without redactions due to the unreliability of the data.
  3. Assume the record shows what it purports to show: someone did something at some time for some reason.

To maximize the utility of available data, we rejected option 1. To minimize the chance of introducing additional errors, we did not select option 2.

We chose option 3: present the data as-is, treating each record as a distinct search because it could be one.

Trade-offs

This approach can lead to both over- and underreporting. But redacted data is still valuable. Even when agencies hide operator names or reasons, the record proves that surveillance occurred—a specific vehicle was searched at a specific time by a specific organization.

When a license plate is hidden, we may not see stalking, but the reason may indicate improper access. When a case number is missing, we know the agency is uninterested in relating searches to legitimate investigations.

Check the source statistics page to see which agencies provide complete data and which choose to redact key fields.