Study: PDF, HTML files dominate Data.gov
Data.gov heavily relies on HTML and PDF for its file formats, leaving two George Mason University researchers to ponder if the federal government’s data repository is achieving what it set out to accomplish.
In a paper published in April, Anne Washington and David Morar of George Mason’s School of Policy, Government and International Affairs combed through the entire Data.gov catalog to figure out what file formats were available and if those files were the most convenient for the intended audience.
The researchers examined files hosted on Data.gov against the five-star open data scheme advocated for by internet pioneer Tim Berners-Lee. The system ranks data stored in PDF and HTML formats on the lower end of the scale, and four- and five-star data formats — those that can be linked together in a fashion similar to how URLs are hyperlinked across the internet — at the upper end.
Washington and Morar modified the five-star structure to account for a number of files on Data.gov being posted in obscure file formats. These files, which were typically formats used in word processing or mapping programs, were given a 0 stars. Unstructured formats, such as HTML or PDF received one star. Proprietary files, such as Microsoft Word or Excel files, were given two stars. Structured machine-readable formats, such as XML or CSV files, were given three stars. Files that contained uniform resource identifiers were ranked the highest with four stars.
Researchers found that of the 244,000 files on Data.gov, more than 30 percent (77,217) are posted in HTML. The second-most popular file format is XML, at 17 percent (42,846). PDFs came in third at 14 percent (34,381), while two lesser-known file formats — ODF and Octet Stream — rounded out the top five.
More than 60 percent of Data.gov’s files were given a one-star rating. Formats that earned three stars — meaning the files are open and machine-readable — finished second, with 23 percent of all Data.gov files falling into this category.
Only 18,347 files — 7 percent — were found to meet the four-star criteria.
The study’s authors found that agencies have embraced publishing information to Data.gov in a format that can be adopted by a wide array of the public. However, the study points out that the government may be too focused on informing the “English-literate public than the data literate who want machine-readable information.”
“If the goal of open government data is machine readable structured file, there may be a legitimate concern about the large number of PDF and HTML files,” the report reads. “The innovators and the data entrepreneurs expect structure machine-readable data.”
Congress is pushing for machine-readable data to be the government’s default format. In April, groups in the House and Senate introduced a bill that calls on agencies to create an inventory of all enterprise data, determine what can be released publicly, and post it with open licenses and in machine-readable formats.
The authors also conclude that the government is going to have to decide how to reach both average users and techies alike.
“Governments attempt to satisfy both the average user, with simple accessible formats, and the sophisticated data consumer, with structured machine-readable formats,” the report reads. “Open government data has established an important pattern of considering both the least and the most sophisticated users. This study suggests that we need a broader conversation about who the data audience will be in the context of open government.”
You can download the full study here.
Contact the reporter on this story via email at greg.otto@fedscoop.com, or follow him on Twitter at @gregotto. His OTR and PGP info can be found here. Subscribe to the Daily Scoop for stories like this in your inbox every morning by signing up here: fdscp.com/sign-me-on.