Better thinking through technology

Current state of text-based projects



I don't post very often on the text-based machine-learning systems, and since that's been the focus of my recent efforts that largely means I don't post much at all. One of the big challenges with any sort of text-based machine-learning system is reading and comprehension. One of the reasons that system like IBM's Watson is considered a big deal is that it can scan a massive pile of texts and pull together all the ideas into a meaningful response.

On the reading side of things, the entire game is about identifying topicality. What the hell is the text even about, and what is the most intelligent summary of the text?

My first attempt at building such a system went vastly better than could be expected. I don't at this time intend to talk much about the secret sauce. Instead I want to take a look at two articles pulled from the internet.


First Example



The first article comes from Tyler Cowen via Bloomberg: How Trump Should Support the Arts.

To fully comprehend what the system can accomplish, you'll need to read the full article. If you're not up for a read, here's what the bot produced as a summary:
Since President-elect Donald Trump has given few clues as to how he will approach arts policy, I thought I would lay out what a Trump administration ought to do. The NEA sends 40 percent of that money to the state arts agencies, in part to shore up its political support across different regions. My first suggestion is to cease such transfers, thereby saving money for arts support at the federal level. Federal-level support for the arts is typically less conservative and more cutting-edge than state-level support, if only because the federal peer-review panels tend to be more sophisticated and local community pressures are weaker. For that reason, reallocating arts support to the federal level could provide more artistic bang for the buck.

Since this change would give the forthcoming Trump administration more control over arts spending, it might have appeal. That’s not sufficient reason to have autocracy, nor am I suggesting that the Trump administration will bring autocracy to the U.S. Nonetheless, tolerating some arbitrary political control over arts spending, at the expense of regular bureaucratic control, could be a plus rather than a negative, especially over successive administrations. Whatever problems you might have with the idea of a Trump administration, it opens up the prospect of a real improvement in American arts policy.
The summaries are typically from a tenth to a third of the length of the original articles.

What I'm happiest with is how lean this version reads. All the fat is trimmed out, and there's nothing left except the meat of the article.

FTR, here's how the topicality of the article has been scored . . .

arts 779
administration 138
support 133
federal 90
artists 81
Trump 44
control 36
money 36
NEA 30
grants 24

I think that's a pretty fair assessment.

973.75 is the target score it's looking for in any given sentence for inclusion. This is a really simple number. It's just the maximum topic score + 25-percent of the that figure. Basically, it'll favor anything with the primary topic plus one or two of the next strongest words.

On a sentence-by-sentence basis to see what was kept and what was tossed.
Since President-elect Donald Trump has given few clues as to how he will approach arts policy, I thought I would lay out what a Trump administration ought to do. (1093)
The first sentence, as with any decently written article, meets the threshold and gets included. Yay!
I applied several standards to my recommendations. (0)
This sentences is rated a zero and gets completely tossed. No love under any circumstances. While its nice to know what standards might've been applied, it's a big bucket of Who Cares in terms of good summarization. Let's take a look at a later sentence that was excluded.
Such a change would take the NEA back to its earliest and arguably most effective period near its origin in 1965, when it supported creators such as Alvin Ailey, Merce Cunningham, George Segal, Ed Ruscha and William Gaddis (all grant recipients in the first year alone), among other luminaries. (163)
Knowing the history of the NEA's work might be helpful, but it doesn't really cut to the core of understanding the article.  Once more, superb work.

I also wanted to take a look at a sentence that approached the threshold but failed to be included.
Therefore grants were shifted to higher-level arts institutions, with the understanding that the institutions would not embarrass the federal government in this manner. (893)
What I really like with this system is that the stuff that barely doesn't make the cut is still much fattier than the stuff that barely does make the cut. We're not trying to keep the fat cap here, so this is good trimming.

Is it helpful to know the whys of how the government allocates these grants? Maybe, and you can certainly dig into the long-form article if you need that info. OTOH, if you're just looking to obtain the simplest a clearest summary of what the article is about, this is a good edit.

To say I'm happy with the bot's editorial discretion would be an understatement. I feel this is a case where it accomplishes what a human never could: cleaning out all the bullshit in the most ruthlessly efficient fashion possible. Editors have to live with writers. The bot DGAF. It's a ruthless meat-seeking editor.


Second Example



This one is an article about the awesomeness of Errol Flynn. The American Scholar: The Lightness of Errol Flynn. If you don't know who Errol Flyn is, don't worry. It just means that you were born well after black-and-white film became an artistic choice rather than a default for filmmakers.

The scoring for this article goes . . .

Flynn 520
film 140
actors 132
Errol 126
movies 100

650 is our target.
I know this sounds crazy—believe me, I know—but I just saw 19 Errol Flynn movies in a row (from Captain Blood, the 1935 film that made him instantly famous, to 1953’s The Master of Ballantrae, his last decent film and good performance before he died in 1959, only 50 years old), and I just read all three of the books he wrote, and I have read an awful lot written and said about him by other people. Some conclusions: One, Errol Flynn is one of the greatest actors I have ever seen, and I have been utterly absorbed in movies and movie actors since I was five years old when my sister took me to a movie theater for the first time (to see the new hit The Swiss Family Robinson). In all of Flynn’s best movies (for my money, in rough order, The Sea Hawk, Gentleman Jim, The Adventures of Robin Hood, The Dawn Patrol, Santa Fe Trail, Captain Blood, Edge of Darkness, Uncertain Glory, and Objective, Burma!??), Errol Flynn is, in a word, irresistible. Many fine actors can be reprised and updated (think of Gregory Peck as a later Henry Fonda), but though hints of Flynn appear in cool Steve McQueen and sprightly Tom Cruise, in suave Pierce Brosnan and bantering Will Smith, Flynn is, like Cary Grant, sui generis—a package of grace, wit, athleticism, looks, humor, and obvious joy in the work and workmates that makes going back to a film you have not seen in many years, like The Sea Hawk or Gentleman Jim, a real and lasting pleasure.
There's no getting around the slightly mad pacing of this article. Someone clearly wants to build a time machine and go back and have a relationship with Errol Flynn that would have been frowned upon at that time.

As you can see, arguments in the form of "first . . . second . . . third" get somewhat hacked up by the bot. There's probably an argument for some human editing, since "Some conclusions: One, Errol Flynn is one of the greatest actors I have ever seen" could be trimmed down into a more readable and better-flowing "Errol Flynn is one of the greatest actors I have ever seen." I'd love to crack the nut on making the machine detect and clean that up, but that level of human readability may present certain challenges. There is a point where it's hard to beat human judgment, and I think this might actually be it.

Where I'm most impressed by the bot is in comparing these two sentences . . .
You would follow him anywhere, not because he is handsome, not because he plays, and is, a captain or a rebel leader, but because it would be immense fun. (0)
. . . and . . .
Many fine actors can be reprised and updated (think of Gregory Peck as a later Henry Fonda), but though hints of Flynn appear in cool Steve McQueen and sprightly Tom Cruise, in suave Pierce Brosnan and bantering Will Smith, Flynn is, like Cary Grant, sui generis—a package of grace, wit, athleticism, looks, humor, and obvious joy in the work and workmates that makes going back to a film you have not seen in many years, like The Sea Hawk or Gentleman Jim, a real and lasting pleasure. (792)
The first sentence is one I would have included because I feel it summarizes the article extremely well. The bot, as you can see, rates it a big fat goose egg. It sees no value in that sentence, and instead favors a much more robust sentence with firm examples -- the sort of thing a good English professor would be proud of. I especially think the comparison of Errol Flynn to Steve McQueen and Will Smith is helpful, since those concepts will go a long way toward bringing younger film students toward an appreciation of what the original author is trying to get at.

In terms of topcality, the second sentence simply crushes. Obviously, Flynn is the main topic, the uber alles of the article. Film and actors are strong secondary themes, and mention those ideas alongside Flynn pushes this sentence into the include pile.



Conclusion




I'm very pleased with where this project is at. Reading comprehension, and especially the ability to trim through the fluff, is one of the bigger challenges you face with any text-based machine-learning system.