Lumen Researcher Interview Series: Professor Daniel Seng
In the second part of the Lumen Researcher interview series, we spoke with Daniel Seng, Associate Professor at National University of Singapore (NUS) and Director at NUS’s Centre for Technology, Robots, Artificial Intelligence, and the Law.
Professor Seng pursued his doctoral degree from Stanford Law School, where he used machine learning, natural language processing, and big data techniques to conduct research on copyright takedown notices. His doctoral thesis was the beginning of his decade-long association with the Lumen Project, then known as Chilling Effects.
As a part of his research, Prof. Seng has worked on five papers that have used the data in the Lumen Database. A bulk of Professor Seng’s focus has been on empirical and quantitative legal analysis, for which he has analyzed 203 million takedown notices and 5.66 billion URLs in the Database. In this interview, Adam Holland, Project Manager at the Lumen Project, and Shreya Tewari, Lumen’s 2021 Research Fellow spoke with Professor Seng about his research work, research methodology and how the Lumen Database catalyzed his research.
Shreya Tewari: Can you tell us about your articles like “Who Watches the Watchmen”, “The State of the Discordant Union” and others where you’ve used Lumen in your research?
Daniel Seng: Sure. I actually have two other papers that talked about Lumen. They are in books so they’re not as easily accessible. One is called “Information Science Intellectual Property and Quantitative Legal Analysis.” This is published in the Oxford Handbook on IP Research, where I explain the quantitative legal analysis of technology, [for example] with reference to the Lumen database. The other is “Big Data and Copyright Law” in the Research Handbook on Big Data Law by Dr. Roland Vogl from Stanford University. It’s just been published. There, I explain how we could do empirical data research using copyright databases. Obviously, there are very, very few such databases around.
So Lumen features prominently in a lot of my writings in addition to the first three papers. The trilogy of papers is actually part of a set of papers that I submitted for my doctoral thesis at Stanford University. They are a progression from a description or overview of the takedown notice mechanism and survey, [The first] cites the Chilling Effects database, which is the predecessor to Lumen.
Then the “Copyrighting Copywrongs” paper, which initially went by a different name, but I had to substantially redo and update all the references when Chilling Effects became Lumen. [This happened in 2014]. Also, I had to expand the coverage of the data set from 2008 to 2012, which was for the initial version, to all the way to 2015, which is the currently published version, adding a lot more notices and analysis.
The last one is “Who Watches the Watchmen paper”. It’s a work in progress. I have to admit that this is probably the most difficult paper of all because it involved machine learning quite a lot. But this one talks about how the intermediaries are actually processing copyright takedown notices and the mechanisms that make the takedown positions work. So it’s related to my research in artificial intelligence, AI systems. Because this is a paper that researches who builds AI driven systems for processing takedown notices.
So, in a nutshell, that’s the trilogy.
Adam Holland: How did you begin on this line of research? What motivated you to look for the takedown notices in these instances, and what led you to your research methodology?
Daniel Seng:
I felt there was an incomplete picture of the mechanics behind the takedown notice mechanism. . . Having taught information technology and law for a couple of years now, there’s only that much that law can tell you about what it says as opposed to how it actually works. And there is a disconnect. For instance, we talk about the mechanisms in the DMCA, the safe harbors. I wanted to find out a simple question, like which provision was most relied upon by who, for what purpose? And when consulting the literature, I couldn’t find any information or any useful information about some of these very important questions which I think went to not just the mechanics of the DMCA, which is very often underappreciated because the DMCA in my view is actually one of the two pieces of American legislations that make the internet what it is today. It’s also not well appreciated that the DMCA has actually gone viral because it’s applied in a lot of countries around the world, even in China. The DMCA-like safe harbors in their laws — they’re not quite a DMCA but I can see figments and fingerprints of that in their implementations. So the question then is how do the mechanisms actually work? And what do fellow academics say about this? I found very little out there. And that got me really very bothered. Because here I was writing all these monographs but I felt that I was not getting a complete picture of the nuts and bolts behind DMCA.
For my thesis, I had the idea of doing an empirical survey of the DMCA takedown notices. At that point in time, the largest data set anyone had yet done at Stanford was about 3,000 data points for empirical analysis. The doctoral committee was quite skeptical that I could handle something as large as this, because of course, no human being can look through 12,000 notices and still maintain any level of coherence in their analysis. So I did do this by a combination of techniques. Even so the doctoral committee advised me to take a year off and do a proof of concept to establish that this was a viable thesis.
It was quite disruptive, but the one year gave me the chance to dive deep into the mechanics of the takedown mechanism, examine the Chilling Effects database, (subsequently Lumen) in far more detail, work out how to collect the notices and correlate notices in an organized way for my own research. And then I spent a lot of time at the Center of Engineering School picking up techniques in natural language processing and machine learning to get this scale of analysis, essentially building towards a set of tools that would allow me to work towards processing the takedown notices.
By 2015, I had analyzed half-a-million notices — from the 12,000 for my pilot study to 500,000. So we called it quits then and I published my thesis. But I really felt dissatisfied because there was so much left to do. So upon my return, I carried on the research. I was just going over the figures. To date, I have analyzed 9.4 million notices, 203 million complaints and 5.66 billion URLs.
It’s a quest that I’ve started because I started this quest of trying to bring greater understanding into the details and mechanics of the takedown process to the entire community, hoping that this will help us better understand its strengths and its weaknesses, to figure out the roles and responsibility of what players in the ecosystem, what they’re actually doing or what they’re doing right and what they’re doing wrong. And then eventually together with the wonderful work that the Harvard team is doing with the Lumen database, contribute towards an informed, policy decision making process. So we know what they’re doing exactly. If you want to change the laws and create, find, what I call a very elusive balance between regulating to protect wonderful, rich commercial Internet and ensuring that free speech and exchange of ideas remain unimpeded and free flowing on the internet.
Adam Holland: Could you talk a little bit more please about your research methodology?
Daniel Seng:
The methodology was challenging I must say. But when I started, I went to the engineering school to pick up on machine learning and natural language processing and statistical analysis. I knew that these were the pieces I needed. But when I asked around as to whether there were any pieces of software that could allow me to do all these things at once, I was very quickly rebuffed. As in, “No. We don’t have anything that allows you to do all these things that you want to do.” So it was a learning experience for me at the engineering school at Stanford because I learned to quickly figure out that there are these tools available. But, additionally, these tools that I wanted for my research are not available. So I ended up spending lot of time developing my own tools. I call it a platform for extracting, parsing, analyzing, tagging, completing, detecting patterns in the notices. I did a lot of coding myself. It also made me become aware of a lot of the cutting edge developments that happen in natural language processing, which I tried as far as possible because of the nature of my work. I threw whatever that I learned into my platform that I developed because I needed these tools to be virtual. It made sense with this huge data set of notices out there to be able to rationalize it, to be able to understand it, to be able to use it.
Shreya Tewari:
What would you have done if the notices had not been available? Would there have been an alternate approach for your research? If so, would that approach had been as effective as the current one?
Daniel Seng:
That question really got me thinking. I haven’t the chance to watch the recent Marvel movies where they talk about alternative timelines. But if indeed I do have access to an alternate timeline and the Lumen database were not available, what I would do is I would apply to do my doctoral studies at Harvard and then maybe get together and find Adam and create the database and hopefully by creating a database, work towards my doctoral thesis. Of course, that also means that by now I will not have finished my doctoral work because I know how much work that involves putting together the database. So, yeah, to me, it’s as if I’m talking about a parallel universe that I cannot consider exists for me, which actually tells you how critical the Lumen database and, of course, its predecessor Chilling Effects has been for me, my research.
The other possibility would be to conduct a survey, a selective survey, we call it a snowball survey, of a few interested parties in the area and ask them questions. But, again, my concern with that is that that gives a distorted view of what we actually know is taking place in the DMCA system. It’s not that it’s impossible to do this. Well, it’s possible. But it would consume a lot of resources. Each approach comes with its strengths and weaknesses. But, in my case, I strongly believe that the quantitative approach has a lot of advantages than a qualitative approach.
Adam Holland: What other features would you like to see added to the Lumen database? How could we make it better for you and for researchers like you to do your research more effectively?
Daniel Seng A few things come to mind. First, it’s always useful to be able to see the statement of good faith and the electronic signature that is part of any compliant DMCA notice. One of the areas of my interest has been whether those formalities have been observed.
I also think about the geolocation detail and IP information associated with each notice, which would be useful information.
Finally, I wonder if it is possible on a selected basis to make available to some researchers the unredacted information [in notices]. That actually can come in handy. I can offhand tell you that there have been several cases where I’ve been quite puzzled by several notices that have their URLs redacted. For me, that interferes with the analysis of the URL, especially when I want to find out if there is the sort of characteristics associated with the URL. So that’s a kind of scenario I’m talking about. This would be only on a selective case by case basis just to verify that the protection did not affect any material information in the database. Maybe one way forward is to have two different types of redaction tags to the complaints? That would make it easy then for us to determine if this is a notice that requires further review or further inspection.
Shreya Tewari: As a final question, what are your thoughts as a scholar in copyright intermediary liability about the importance of transparency through notice sharing? How that reflects on policy making and research?
Daniel Seng:
Most certainly. the prime example I always cite about the need for transparency, which goes hand in hand with accountability, is the availability of what [information] makes a takedown complete. I think in this regard Lumen’s achievements are just incredible because you’ve actually set the path forward for much of this debate that is taking place right now amongst the engineering community regarding what is just pertaining to a comfortable AI, expandable AI, and the need for transparency and accountability. What better way to promote accountability than to be transparent about it? To, within the limits of considerations like information and privacy and data protection, explain the mechanics behind what you are actually doing. So I’m actually surprised that you yourselves have been able to get so much cooperation from so many of these internet intermediaries to make this a sustainable project. This would actually be the hallmark of what every intermediary should do if they want to encourage accountability. I wrote in my empirical paper, “The world will have to thank Lumen for this wonderful job in shedding light in this very strange area of copyright protection.” Because it’s often not frequently appreciated. But copyright is a formality free system. The Lumen system is its proxy because if you think about it, Trademark systems, patent systems, require registration but not copyrights. So what’s the alternative or substitute for that? The Lumen database. So we have to thank the Lumen database for shedding light into this very strange area of intellectual property law that has no registration requirements.