The best part about writing a dissertation is finding clever ways to procrastinate. The motivation for this blog comes from one of the more creative ways I’ve found to keep myself from writing. I’ve posted about data mining in the past and this post follows up on those ideas using a topic that is relevant to anyone that has ever considered getting, or has successfully completed, their PhD.

I think a major deterrent that keeps people away from graduate school is the requirement to write a dissertation or thesis. One often hears horror stories of the excessive page lengths that are expected. However, most don’t realize that dissertations are filled with lots of white space, e.g., pages are one-sided, lines are double-spaced, and the author can put any material they want in appendices. The actual written portion may only account for less than 50% of the page length. A single chapter may be 30-40 pages in length, whereas the same chapter published in the primary literature may only be 10 or so pages long in a journal. Regardless, students (myself included) tend to fixate on the ‘appropriate’ page length for a dissertation, as if it’s some sort of measure of how much work you’ve done to get your degree. Any professor will tell you that page length is not a good indicator of the quality of your work. Regardless, I feel that some general page length goal should be established prior to writing. This length could be a minimum to ensure you put forth enough effort, or an upper limit to ensure you aren’t too excessive on extraneous details.

It’s debatable as to what, if anything, page length indicates about the quality of one’s work. One could argue that it indicates absolutely nothing. My advisor once told me about a student in Chemistry that produced a dissertation that was less than five pages, and included nothing more than a molecular equation that illustrated the primary findings of the research. I’ve heard of other advisors that strongly discourage students from creating lengthy dissertations. Like any indicator, page length provides information that may or may not be useful. However, I guarantee that almost every graduate student has thought about an appropriate page length on at least one occasion during their education.

The University of Minnesota library system has been maintaining electronic dissertations since 2007 in their Digital Conservancy website. These digital archives represent an excellent opportunity for data mining. I’ve developed a data scraper that gathers information on student dissertations, such as page length, year and month of graduation, major, and primary advisor. Unfortunately, the code will not work unless you are signed in to the University of Minnesota library system. I’ll try my best to explain what the code does so others can use it to gather data on their own. I’ll also provide some figures showing some relevant data about dissertations. Obviously, this sample is not representative of all institutions or time periods, so extrapolation may be unwise. I also won’t be providing any of the raw data, since it isn’t meant to be accessible for those outside of the University system.

I’ll first show the code to get the raw data for each author. The code returns a list with two elements for each author. The first element has the permanent and unique URL for each author’s data and the second element contains a character string with relevant data to be parsed.

#import package require(XML) #starting URL to search<-'' #output object dat<-list() #stopping criteria for search loop stp.txt<-'2536-2536 of 2536.' str.chk<-'foo' #initiate search loop while(!grepl(stp.txt,str.chk)){ html<-htmlTreeParse(,useInternalNodes=T) str.chk<-xpathSApply(html,'//p',xmlValue)[3] names.tmp<-xpathSApply(html, "//table", xmlValue)[10] names.tmp<-gsub("^\\s+", "",strsplit(names.tmp,'\n')[[1]]) names.tmp<-names.tmp[nchar(names.tmp)>0] url.txt<-strsplit(names.tmp,', ') url.txt<-lapply( url.txt, function(x){ cat(x,'\n') flush.console() #get permanent handle url.tmp<-gsub(' ','+',x) url.tmp<-paste( '', paste(url.tmp,collapse='%2C+'), sep='' ) html.tmp<-readLines(url.tmp) str.tmp<-rev(html.tmp[grep('handle',html.tmp)])[1] str.tmp<-strsplit(str.tmp,'\"')[[1]] str.tmp<-str.tmp[grep('handle',str.tmp)] #permanent URL #parse permanent handle perm.tmp<-htmlTreeParse( paste('',str.tmp,sep=''),useInternalNodes=T ) perm.tmp<-xpathSApply(perm.tmp, "//td", xmlValue) perm.tmp<-perm.tmp[grep('Major|pages',perm.tmp)] perm.tmp<-c(str.tmp,rev(perm.tmp)[1]) } ) #append data to list, will contain some duplicates dat<-c(dat,url.txt) #reinitiate url search for next iteration<-strsplit(rev(names.tmp)[1],', ')[[1]]<-gsub(' ','+',<-paste( '', paste(,collapse='%2C+'), sep='' ) } #remove duplicates dat<-unique(dat)

The basic approach is to use functions in the package to import and parse raw HTML from the web pages on the Digital Conservancy. This raw HTML is then further parsed using some of the base functions in R, such as and . The tricky part is to find the permanent URL for each student that contains the relevant information. I used the ‘browse by author’ search page as a starting point. Each ‘browse by author’ page contains links to 21 individuals. The code first imports the HTML, finds the permanent URL for each author, reads the HTML for each permanent URL, finds the relevant data for each dissertation, then continues with the next page of 21 authors. The loop stops once all records are imported.

The important part is to identify the format of each URL so the code knows where to look and where to re-initiate each search. For example, each author has a permanent URL that has the basic form plus ‘handle/12345’, where the last five digits are unique to each author (although the number of digits varied). Once the raw HTML is read in for each page of 21 authors, the code has to find text where the word ‘handle’ appears and then save the following digits to the output object. The permanent URL for each student is then accessed and parsed. The important piece of information for each student takes the following form:

This code is found by searching the HTML for words like ‘Major’ or ‘pages’ after parsing the permanent URL by table cells (using the <td></td> tags). This chunk of text is then saved to the output object for additional parsing.

After the online data were obtained, the following code was used to identify page length, major, month of completion, year of completion, and advisor for each character string for each student. It looks messy but it’s designed to identify the data while handling as many exceptions as I was willing to incorporate into the parsing mechanism. It’s really nothing more than repeated calls to using appropriate search terms to subset the character string.

#function for parsing text from website get.txt<-function({ #separate string by spaces<-strsplit(gsub(',',' ',,fixed=T),' ')[[1]]<-gsub('.','',,fixed=T) #get page number pages<[grep('page',[1]-1] if(grepl('appendices|appendix|:',pages)) pages<-NA #get major, exception for error if(class(try({ major<[c( grep(':|;',[1]:(grep(':|;',[2]-1) )] major<-gsub('.','',gsub('Major|Mayor|;|:','',major),fixed=T) major<-paste(major[nchar(major)>0],collapse=' ') }))=='try-error') major<-NA #get year of graduation yrs<-seq(2006,2013) yr<[grep(paste(yrs,collapse='|'),[1]] yr<-gsub('Major|:','',yr) if(!length(yr)>0) yr<-NA #get month of graduation months<-c('January','February','March','April','May','June','July','August', 'September','October','November','December') month<[grep(paste(months,collapse='|'),[1]] month<-gsub('dissertation|dissertatation|\r\n|:','',month) if(!length(month)>0) month<-NA #get advisor, exception for error if(class(try({ advis<[(grep('Advis','computer',] advis<-paste(advis,collapse=' ') }))=='try-error') advis<-NA #output text c(pages,major,yr,month,advis) } #get data using function, ran on 'dat' check.pgs<'rbind', lapply(dat,function(x){ cat(x[1],'\n') flush.console() c(x[1],get.txt(x[2]))}) ) #convert to dataframe check.pgs<,sringsAsFactors=F) names(check.pgs)<-c('handle','pages','major','yr','month','advis') #reformat some vectors for analysis check.pgs$pages<-as.numeric(as.character(check.pgs$pages)) check.pgs<-na.omit(check.pgs) months<-c('January','February','March','April','May','June','July','August', 'September','October','November','December') check.pgs$month<-factor(check.pgs$month,months,months) check.pgs$major<-tolower(check.pgs$major)

The section of the code that begins with takes the online data (stored as on my machine) and applies the function to identify the relevant information. The resulting text is converted to a data frame and some minor reworkings are applied to convert some vectors to numeric or factor values. Now the data are analyzed using the object.

The data contained 2,536 records for students that completed their dissertations since 2007. The range was incredibly variable (minimum of 21 pages, maximum of 2002), but most dissertations were around 100 to 200 pages.

Interestingly, a lot of students graduated in August just prior to the fall semester. As expected, spikes in defense dates were also observed in December and May at the ends of the fall and spring semesters.

The top four majors with the most dissertations on record were (in descending order) educational policy and administration, electrical engineering, educational psychology, and psychology.

I’ve selected the top fifty majors with the highest number of dissertations and created boxplots to show relative distributions. Not many differences are observed among the majors, although some exceptions are apparent. Economics, mathematics, and biostatistics had the lowest median page lengths, whereas anthropology, history, and political science had the highest median page lengths. This distinction makes sense given the nature of the disciplines.

I’ve also completed a count of number of students per advisor. The maximum number of students that completed their dissertations for a single advisor since 2007 was eight. Anyhow, I’ve satiated my curiosity on this topic so it’s probably best that I actually work on my own dissertation rather than continue blogging. For those interested, the below code was used to create the plots.

###### #plot summary of data require(ggplot2) mean.val<-round(mean(check.pgs$pages)) med.val<-median(check.pgs$pages) sd.val<-round(sd(check.pgs$pages)) rang.val<-range(check.pgs$pages) txt.val<-paste('mean = ',mean.val,'\nmed = ',med.val,'\nsd = ',sd.val, '\nmax = ',rang.val[2],'\nmin = ', rang.val[1],sep='') #histogram for all hist.dat<-ggplot(check.pgs,aes(x=pages)) pdf('C:/Users/Marcus/Desktop/hist_all.pdf',width=7,height=5) hist.dat + geom_histogram(aes(fill=..count..),binwidth=10) + scale_fill_gradient("Count", low = "blue", high = "green") + xlim(0, 500) + geom_text(aes(x=400,y=100,label=txt.val)) #barplot by month<-ggplot(check.pgs,aes(x=month,fill=..count..)) pdf('C:/Users/Marcus/Desktop/month_bar.pdf',width=10,height=5.5) + geom_bar() + scale_fill_gradient("Count", low = "blue", high = "green") ###### #histogram by most popular majors #sort by number of dissertations by major get.grps<-list(c(1:4),c(5:8))#,c(9:12),c(13:16)) for(val in 1:length(get.grps)){ pop.maj<-names(sort(table(check.pgs$major),decreasing=T)[get.grps[[val]]]) pop.maj<-check.pgs[check.pgs$major %in% pop.maj,]<-aggregate(pop.maj$pages,list(pop.maj$major),function(x) round(median(x))) pop.n<-aggregate(pop.maj$pages,list(pop.maj$major),length) hist.maj<-ggplot(pop.maj, aes(x=pages)) hist.maj<-hist.maj + geom_histogram(aes(fill = ..count..), binwidth=10) hist.maj<-hist.maj + facet_wrap(~major,nrow=2,ncol=2) + xlim(0, 500) + scale_fill_gradient("Count", low = "blue", high = "green") y.txt<-mean(ggplot_build(hist.maj)$panel$ranges[[1]]$y.range) txt.dat<-data.frame( x=rep(450,4), y=rep(y.txt,4),$Group.1, lab=paste('med =',$x,'\nn =',pop.n$x,sep=' ') ) hist.maj<-hist.maj + geom_text(data=txt.dat, aes(x=x,y=y,label=lab))<-paste('C:/Users/Marcus/Desktop/group_hist',val,'.pdf',sep='') pdf(,width=9,height=7) print(hist.maj) } ###### #boxplots of data for fifty most popular majors pop.maj<-names(sort(table(check.pgs$major),decreasing=T)[1:50]) pop.maj<-check.pgs[check.pgs$major %in% pop.maj,] pdf('C:/Users/Marcus/Desktop/pop_box.pdf',width=11,height=9) box.maj<-ggplot(pop.maj, aes(factor(major), pages, fill=pop.maj$major)) box.maj<-box.maj + geom_boxplot(lwd=0.5) + ylim(0,500) + coord_flip() box.maj + theme(legend.position = "none", axis.title.y=element_blank())

Update: By popular request, I’ve redone the boxplot summary with major sorted by median page length.

Writing 25,000 words in 8 days. Possible?

posted about 9 years ago

So, I have to have 25000 words, which I'm guessing will be the size of this chapter, written by 3 June. I have all the research done, and I have the structure of the chapter, and I'm basically just going to 'search' on my computer when I get to each section and keying the subject/topic in and looking at all my research on that area and then writing it from that. I've just written a princely 154 words so far in 20minutes.:-)

It is more a creative exercise than a research one, about presenting my research well. I'm talking about writing 3000 words per day each day until next Wednesday night. Is it possible? How many words of PhD standard is the most you've written in a day or week?

I'm sort of buzzing now as I look at all my info and the only problem is what to leave out. But having a deadline will definitely help me, I think.

I'll keep a little daily diary here if I think of it!

posted about 9 years ago

Hmm....that's a tall order and whether it is possible depends on how quickly you can assimilate your research notes. I tend to research and write together so I can't really say if it's possible to just write-up 3000 words a day, but I'm pretty sure it is. I do remember writing my Master's thesis (obviously not PhD quality) in 2/3 days from rough research notes. You'll simply have to glue your bum to your desk chair, and switch off all distractions.

You'll probably make lots of mistakes, so set an hour aside each day to re-read what you've written.

I'm actually attempting a similar feat of 15,000 words in two weeks, but with a lot of research unfinished.

Edit: just to add, why does it have to be 25,000 words?

posted about 9 years ago

That is a lot, although I once did 2500 words in a day which isn't far off 3000 - don't know if I could do that for eight days straight though! I think if you even had the bones of your chapter done in the 8 days that would be great, is there some reason you have to hit the full count by June 2nd? If not then just focus on getting the chapter done you can try to add more words later!

posted about 9 years ago

That does sound like a lot. I have managed about 1500 a day before, but I was clear about what I was going to say. It can probably be done, working very concentrated and long hours.. but that kind of regime for eight days? No, not impossible, but it sounds extremely hard!!

posted about 9 years ago

Personally, I can't do more than about 1000 words per day of PhD quality.  I know others that can do a lot more, so I suppose it could be possible depending on the individual. I always find that other things get in the way, like supervisor meetings and Cash In The Attic/Homes Under The Hammer, etc.  Best wishes though!

posted about 9 years ago

2000 a day for a couple of weeks I'd say is perfectly achievable, if exhausting, if you're just writing up notes. To push it up to 3000 I think you need to separate your working day (which is obviously going to a pretty long one for the next 8 days!) into three sections of 1000 words, making sure you take proper breaks and get as much sleep as possible each night. Good luck - I don't envy you!!

posted about 9 years ago

Wow, that is quite some feat! If I'm on a roll I can normally manage around 2-3K a day, having said that, I wrote 1.5K yesterday and was worn out and everything I tried to write after that was utter rubbish - so I'd agree, you'd have to split it up and make sure you get good breaks in. I tend to splurge it all out then spend another week going back over it to edit and turn it in to English lol. I wrote the whole first draft for my MA dissertation in 8 days and that was 20K words so it is possible, but its exhausting. Hope you get it done ok :-) I thought I was hard done by having to get my board paper written and submitted by the 15th at around 10K words lol

posted about 9 years ago

Aodhán - from your other posting it looks like you have to do this for your thesis resubmission. Of course you can do it! You have very clear goals set out. Write exactly to those goals and don't change anything else!!!

posted about 9 years ago

Oh! and cancel EVERYTHING for 8 days. literally. Then, on the 3rd of June, hand it in and go to the nearest pub/cafe/shop and get a massive glass of wine/disgustingly chocolatey cake/crazy beautiful thing you don't need (or preferably all). Then go home and sleep for a couple of days. After that, the next phase of your life starts!

posted about 9 years ago

nice advice from A116!!! Sometimes I do approximately 2,000 words per day but again it depends on what I'm writing about.
Sometimes I become depressed and dont write anything for days.
Then miraculously I come out of this no-writing-no-working phase and then get back to it.
You'll be ok.
cheers satchi

by CarriKP

posted about 7 years ago

Did you manage to reach your daily goal? I'm interested because I'm trying to finish a Master's diss with a fairly close deadline 8-) - I also have done the research.

posted about 7 years ago

I remember reading this book (wish I could remember what the title was and the author... it was a book on PhD writing or research writing anyway) and the author was saying how she once reneted a log cabin and basically got everything written in 10 days cos she was just in the log cabin with no distractions whatsoever. Now that's great if you could afford to go off and rent a log cabin and you weren't worried about Jason or Leatherface coming after you but I think you can borrow her logic in the sense that try and find a setting in which you won't have distractions and tell your friends and family you've got to get this done so no distractions for 5-7 days and just sit there and get it done. Still sounds like a huge amount to try and get done and esp. if you're stressed but definitely a quiet environment, adopting a 9-5 approach and just sitting there and saying 'this has to be done' helps!

