Hierarchical Multiscale Recurrent Neural Networks for Detecting Suicide Notes

—Recent statistics in suicide prevention show that people are increasingly posting their last words online and with the unprecedented availability of textual data from social media platforms researchers have the opportunity to analyse such data. Furthermore, psychological studies have shown that our state of mind can manifest itself in the linguistic features we use to communicate. In this paper, we investigate whether it is possible to automatically identify suicide notes from other types of social media blogs in two document-level classiﬁcation tasks. The ﬁrst task aims to identify suicide notes from depressed and blog posts in a balanced dataset, whilst the second experiment looks at how well suicide notes can be classiﬁed when there is a vast amount of neutral text data, which makes the task more applicable to real-world scenarios. Furthermore we perform a linguistic analysis using LIWC (Linguistic Inquiry and Word Count). We present a learning model for modelling long sequences in two experiment series. We achieve an f1-score of 88.26 % over the baselines of 0.60 in experiment 1 and 96.1 % over the baseline in experiment 2. Finally, we show through visualisations which features the learning model identiﬁes, these include emotions such as love and personal pronouns.


INTRODUCTION
W HILST both machine and deep learning techniques have been predominantly used for commercial purposes, there has also been an increased awareness of how AI approaches could contribute to solving some of the biggest social problems humans face worldwide [1].This awareness has led to the creation of new workshops and conferences that fall under the umbrella of AI for Social Good, where machine learning researchers connect with Non-Governmental Organisations (NGOs), charities and other problem owners to create practical solutions.These problems and challenges are usually closely linked to accelerating progress towards the UN Sustainable Development Goals (SGD) produced by the United Nations (UN) [2].These goals include, but are not limited to protecting democracy, education, social welfare and justice as well as health care and environmental sustainability.
Especially within the SGD for health care, there is an increased focus on mental health.In a recent report, the World Health Organisation [2] outlines that suicide is the second leading cause of death for people aged 15-29 worldwide.Reducing the rate of suicide worldwide has therefore been listed as one of the objectives of the Sustainable Development Goals for health care.It is estimated that around 25-30% of people who died by suicide leave behind a suicide note, however, this figure can be as high as 50% depending on cultural or ethnic differences in demographics [3].[4] have found that there is an increasing trend amongst younger people to publish their suicide notes or express their suicidal feelings online.Furthermore, psychological studies have shown that our state of mind can manifest itself in the linguistic features we use to communicate [5], [6].At the same time, the use of social media platforms, such as blogging websites has become part of everyday life and there is increasing evidence emerging that social media can influence both suicide-related behaviour and other mental health conditions.Whilst there are efforts to tackle suicide and other mental health conditions online by social media platforms such as Facebook [7], there are still concerns that there is not enough support and protection, especially for younger users [8].
Taking these trends into account and with this unprecedented availability of textual data from social media platforms researchers have now the opportunity to analyse such data and use their findings in several different application areas.This has led to a notable increase in research of suicidal and depressed language usage [9], [10] and subsequently triggered the development of new healthcare applications and methodologies that aid detection of concerning posts on social media platforms [11].Traditionally, work on suicide notes has focused on distinguishing genuine from forged suicide notes in the field of forensic linguistics, where the findings were used as additional evidence in legal proceedings [12].However, in recent years and with the advances in machine and deep learning, there has been an increasing amount of research conducted to identify suicidal ideation or suicide notes in online settings, such as social media platforms [13], [14].
In this paper, there will firstly be an exploration of existing research and literature in the field of suicide note detection in section 2. Then there will be an analysis of the linguistic features for the different datasets used in section 3.In section 4 we will introduce the learning model, a dilated LSTM with attention.Next, there will be a series of two experiments using two different kinds of datasets and a variety of recurrent neural networks in section 5.For the first experiment series we use a balanced dataset to classify suicide notes, depressed posts and blog posts to see how hard the task proves in this setting.The second experiment aims to make the task more applicable to the real world and both depressed and blog posts are increased to reflect the rarity of genuine suicide notes on social media platforms.In section 6 we discuss the experimental results and evaluate the visualisations.

RELATED WORK
Over the years there has been much research conducted into the accurate classification of suicide notes or detection of suicidal ideation online [13], [14], where researchers use several different methodologies including but not limited to traditional machine learning [15], deep learning [16] and sentiment analysis [17].Such research has been conducted in a range of different disciplines like psychology [18], linguistics [13] or healthcare [19].Many experiments have also been conducted comparing different types of textual data with suicide notes such as depressed language or blog posts [20].Overall there has been a growing interest in looking at content created online that may solicit need for help [21] or detecting mental health issues [22].This literature review will focus on introducing work looking at the classification of suicide notes and suicidal ideation detection, but also review work in the space of depressed language and last statements due to the nature of the experiments.

Suicide note classification
The analysis of suicide notes has been used in various academic settings such as psychology or forensic linguistics to either identify the genuineness of a suicide note or to predict the state of mind of a note writer [12].It has been argued in previous research that our drive or motivation affects how we communicate and therefore it is believed that our spoken and written language represents those shifting psychological states [23].This argument has been taken further by [6] who suggested that there is a shift in one's linguistic expression due to the aroused cognitive state suicidal individuals experience.These findings have led to [4]'s argument that there is an increased need for 'automatic procedures that can spot suicidal messages and allow stakeholders to quickly react to online suicidal behaviour or incitement'.Therefore recent research has looked at different aspects of suicide notes to find out what "makes" a suicide note, where identifying linguistic features and patterns, affective states or specific emotions as well as dominant topics have been used in different analyses and experiments.[24] provide an overview of applications, methods and domains in suicide note research.
One of the settings in which the validation of a suicide note is important is in court cases or hearings where expert evidence is given by professionals such as forensic linguists to verify the author of the note or its genuineness [12].Another field where the analysis of suicide notes is crucial is psychology, where one of the most commonly cited studies has been conducted by [25].In their study, they collected a corpus of 33 genuine suicide notes and another set of 33 suicide notes that were forged.Their analysis showed that there was a clear difference in language used, which made the genuine notes distinctive when compared to the forged notes.This study has been used as a foundation for many other studies afterwards [26] and researchers such as [5] have compared this set of suicide notes with a set of normal letters to friends.Whilst especially early work in linguistics and psychology has mainly focused on the distinguishing factors of linguistics and topics [27], the availability of such data to researchers from other disciplines has opened up opportunities to use traditional machine learning and feature engineering for classifying suicide notes.[28] have used a supervised classification model and a set of linguistic features to distinguish genuine from forged suicide notes, achieving an accuracy of 82%.Studies using traditional machine learning have been taken further recently by [29] who also used a set of suicide notes and correctly hypothesised that when applying the set to a machine learning algorithm it would outperform mental health professionals in classifying suicide notes correctly.Detecting affective states or emotions in such data has also grown in popularity.Particularly the work of [10] has been influential in the field and in their study they have found that there are fifteen different emotional concepts which prove to be significant in identifying genuine suicide notes.These fifteen sentiment features have also been used by [30] in the i2b2/VA/ Cincinnati Medical Natural Language Processing Challenge.The challenge aimed to develop a model which could automatically identify emotions on sentencelevel of a suicide note.The hybrid model developed by [30]) achieved an accuracy of 61.39% in detecting emotions using various techniques such as machine learning-based emotion classification.[30] argue that one of the key factors for successful identification of emotions is to split the 15 pre-specified emotions into three different classes (positive, negative and neutral).[31] have focused on combining both sentiment and linguistic features which led to achieving a test accuracy of 86.6%.[32] have used four different feature groups including sentiment to assess suicide risk using a hybrid model.
Suicide note research has not only focused on the sentiment conveyed in notes but also on linguistic [5] and content [33] features.Research conducted by [28] used Receiver Operating Characteristic (ROC) Analysis to distinguishing genuine and forged suicide notes from each other, yielding an average accuracy of 0.82 AUC.Other work conducted by [31] has found that using a combination of both linguistic and sentiment features achieves an accuracy of 86.61% by using a logistic model tree (LMT).

Suicide ideation classification
Recent years have seen an increase in the analysis of suicidal ideation on social media platforms, such as Twitter.[14] searched the Twitter API for specific keywords and analysed the data using both traditional machine learning techniques as well as neural networks, achieving an accuracy of 97.6% using neural networks.Research conducted by [34] has developed a classifier to distinguish suicide-related themes such the reports of suicides and casual references to suicide.The increased use of deep learning in other areas of Natural Language Processing [35] has also led to more studies using Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) to detect suicide notes or suicidal ideation [36].Work by [37] used multiple neural network architectures to detect suicidal ideation.Research by [38] uses multitask learning to estimate the risk of suicide using multiple public datasets from various shared tasks.Work by [39] has looked at identifying suicidal ideation on Twitter by using lexical, structural and sentiment features, using traditional machine learning and achieved an F-measure of 0.728.

Depression notes
Work on identifying depression and other mental health conditions has become more prevalent over recent years, where a shared task was dedicated to distinguishing depression and Post Traumatic Stress Disorder (PTSD) on Twitter using machine learning [9].[40] have argued that changes in the cognition of people with depression can lead to different language usage, which manifests itself in the use of specific linguistic features.Research conducted by [41] also used linguistic signals to detect depression with different topic modelling techniques.Work by [42] used the Linguistic Inquiry and Word Count software (LIWC) to analyse written documents by students who have experienced depression, currently depressed students as well as students who never have experienced depression, where it was found that individuals who have experienced depression used more firstperson singular pronouns and negative emotion words.[43] used LIWC to detect differences in language in online depression communities, where it was found that negative emotion words are good predictors of depressed text compared to control groups using a Lasso Model [44].Research conducted by [45] showed that using LIWC to identify sadness and fatigue helped to accurately classify depression.[46] use Convolutional Neural Networks to model the relationship between depression and people who attempt suicide.Some work has focused on detecting mental health signals related to other conditions such as bipolar disorder, major depressive disorder, post-traumatic stress disorder and seasonal affective disorder [47].In their work [48] have looked extensively at the which features are relevant when classifying depression in tweets.

Social Media blogs
Work on classifying blogs from social media platforms has focused on predicting sentiment or emotions [49] or characteristics of the author of a blog, such as age [50] or gender [51].Other work has focused on modelling ideologies in blogs using topic modelling techniques [52].

DATA
This section provides an overview of the different datasets used as well as where and how they have been collected.All corpora have been anonymised in order to protect the authors' identity and those mentioned in their communication, which includes any places, names or references to identifying information.The examples of notes below have been chosen for their brevity, many of the notes in the corpus are of greater length1 .Previous work in this area has predominantly focused on distinguishing suicide notes from other types of notes that are in a distinct category, e.g.: depression or love notes [31].However, when attempting to classify suicides notes from 'neutral' blog posts is harder, because they usually do not come in neat types of categories.
Therefore we have chosen a random sample of blog posts to make the task more applicable to real-world scenarios.Furthermore, classifying suicide notes in such a setup could help to identify further distinguishing features in the language used in these notes.Below we outline the different datasets used in the subsequent experiments and examples of the notes can be seen in Figures 1, 2, 3:

Genuine Suicide Note Data
Genuine suicide notes provide a unique insight into the mindset of a person who has died by suicide [53].Therefore we have chosen to only use genuine suicide notes in our experiments and made a conscious decision not to use other datasets such as Twitter suicide datasets [39].The main reason for this being that these tweets have mainly been collected using specific keywords such as 'suicide' to accumulate the data and there is no human verification that the person who wrote this tweet is indeed suicidal or has passed away.Due to the sparsity of genuine suicide notes that are publicly available, we have added new genuine suicide notes to the corpus provided by [20].Other new additions to this corpus includes data from various sources (for a full list, see Appendix A).There is a total of 211 genuine suicide notes (hereafter GSN, see Figure 1) used in these experiments.

Depression Notes
We used the Reddit depression data provided by [54] to create two different datasets for the two experiments.The first dataset consists of 211 depressed notes, hereafter referred to as DL1 and the second dataset includes 1293 depressed notes (hereafter referred to as DL2, see Figure 2).

Neutral Blog Posts
We have chosen a random number of online blog posts as our neutral category, which were collected by [55].For both types of experiments we used 211 blog posts (hereafter referred to as NEU1) and 3500 examples of blog posts (hereafter referred to as NEU2).We have chosen this amount of blog posts empirically to ensure that the overall amount of GSN notes is below 5% to make the task more applicable to the real world, an example of a neutral blog post can be seen in Figure 3.

Linguistic Analysis
To gain more insight into the content of the datasets, we performed a linguistic analysis to show differences in the structure and contents of the datasets.For this study, we used the Linguistic Inquiry and Word Count software (LIWC) [56], which has been developed to analyse textual data for psychological meaning in words.We report the average of all results across each dataset.LIWC has been used in previous research to annotate datasets for suicide risks in addition to experts to determine linguistic profiles of suicide-related Twitter posts [57].Other work by [42] used LIWC to analyse written documents by students who have experienced depression, currently depressed students as well as students who never have experienced depression.

Dimension Analysis
Firstly, we looked at the word count and different dimensions of each dataset (see Table 1).It has previously been argued by [56] that the words people use can give insight into the emotions, thoughts and motivations of a person, where LIWC dimensions correlate emotions as well as social relationships.The number of words per sentences are highest in DL writers and lowest in GSN notes.Research by [5] has suggested that people in stressful situations break their communication down into shorter units.This may indicate alleviated stress levels in individuals writing notes before taking their own life.Clout stands for the social status or confidence expressed in a person's use of language [58].This dimension is highest for people writing blog posts, whereas depressed people rank lowest on this.[59] have noted that this might be because depressed individuals often have a lower socioeconomic status.The Tone of a note refers to the emotional tone, including both positive and negative emotions, where numbers below 50 indicate a more negative emotional tone [60].The tone for GSN is highest overall and the lowest in DL, indicating a more overall negative tone in DL and positive tone in GSN.It was also found by [57] that the most concerning suiciderelated content found on Twitter also had a higher word count, here the highest word count is in blog posts, whilst GSN notes have the lowest word count.This could be due to the fact that there is no character restriction placed upon bloggers.SixItr in Table 1 refers to words that are longer than 6 letters and are meant to indicate the social class and level of education of a person.It can be seen that the lowest scores were observed by GSN and the highest by blog posts writers.Whilst there is no additional information available to evaluate both educational or socio-economic factors, it could be argued that the lack of longer words in GSN notes is due to the argument made by [5] that GSN writers break communication down to shorter units due to the alleviated stress-levels and not their educational or socio-economic background.The Analytical thinking dimension indicates to which extent people use "formal, logical, and hierarchical thinking patterns" [56].NEU writers score highest in this category and GSN writers score lowest.It has been found that people who score low in analytical thinking tend to write and use spoken language more narratively and focus on the present as well as personal experiences, compared to people who score highly in this [58].The term Authenticity refers to which extent people write about themselves in an honest way, where they are typically portrayed as more humble, personal and vulnerable [61].DL notes were the most authentically written whilst the least authentic words were written by NEU writers.Arguably blog posts do not require a writer to be vulnerable and with the increasing amount of blogging as a marketing tool there may be less personal or humble language found in these posts.

Function Words and Content Words
The next section looks at selected function words and grammatical differences, which can be split into two categories called Function Words (see  Function words refer to a variety of different word categories, such as pronouns or auxiliary verbs and make up the majority of all words that are persons uses [56].It was found that there is a difference in how human brains process function and content words [62].Research has also found that function words have been connected with indicators of people's social and psychological worlds [56], where it has been argued that the use of function words require basic skills.The highest amount of function words were used in DL notes, whilst blog posts have the least amount of function words.[42] has found that high usage, specifically of first-person singular pronouns ("I") could indicate higher emotional and/or physical pain as the focus of their attention is towards themselves.Overall [57] has also identified a larger amount of personal pronouns in suicide-related social media content.This may be the reason why GSN notes are high in personal pronouns overall and the first-person singular, whilst blog posts are lowest in both categories.Previous work by [63] has found that people use a higher amount of negations when also expressing negative emotions and used fewer words overall, compared to more positive emotions.This seems to be also true for the number of negations used in this case where the number of Negations were also highest in the DL corpus and lowest in the blog corpus, whilst negative emotions are also highest in DL notes.Furthermore, it was found that Verbs, Adverb and Adjectives are often used to communicate content, however previous studies have found [28], [53] that individuals that die by suicide are under a higher drive and therefore would reference a higher amount of objects (through nouns) rather than using descriptive language such as adjectives and adverbs.This may explain why the number of adjectives and adverbs are lowest in GSN notes and highest in DL notes.

Affect Analysis
The analysis of emotions in suicide notes and last statements have often been addressed in research [64], [65].Table 3 shows sentiments and emotions that were detected for the datasets using LIWC.Overall the highest amount of affect words are in DL notes, whilst the lowest amount is in blog posts.This may also relate back to the level of authenticity usually found in DL notes and lacking in blog posts due to blog posts writers not being as vulnerable.The highest amount of Negative emotions are also highest in DL notes and lowest in blog posts, similarly as before this may refer back to the Tone used in those type of corpora where a higher amount of negative emotions are often correlated with the tone used in a note.Positive emotions are highest in GSN notes, whilst they are lowest in DL notes.This has been found previously by [31], who have found that emotions such as 'love' are more frequently found in GSN notes than other corpora.Blog posts display the lowest amount of emotions such as anger, sadness or anxiety, whilst this is highest in the DL corpus.It has been shown in prior research that these emotions are more prevalent in DL note writers as these are typical feelings expressed when people suffer from depression [31].

Social and Psychological Processes
Social Processes highlights the social relationships of note writers, where it can be seen in Table 4 that the highest amount of social processes can be found in GSN and the lowest in DL.Furthermore GSN notes tend to speak most about family relations and least about friends.Table 5: LIWC Psychological Processes [66] have found that people who use more cognitive mechanisms to cope with traumatic events such as break ups by using more causal words to organise and explain events and thoughts for themselves.Insight encompasses words such as think or consider, whilst Cause encompasses words that express reasoning or causation of events, e.g.: because or hence.These terms have previously been coined as cognitive process words by [53], who argued that these words are less used in GSN notes as the writer has already finished the decision making process whilst other types of discourse would still try to justify and reason over events and choices.Similar results can be found in our own data, where both Insight and Cause are low in GSN notes, but high in DL notes.Tentativeness refers to the language use that indicates a person is uncertain about a topic and uses a number of filler words.It has been argued that participants who use more tentative words, may have not expressed an event to another person and therefore have not processed an event yet and it has not been formed into a story [56].The amount of tentative words used in DL notes is highest, whilst it is lowest in GSN notes.This might be due to the fact that GSN writers already had to reiterate over certain events multiple times and have made their decision [53].

Personal Concerns
Personal Concerns refers to the topics most commonly brought up in the different notes (see Table 6), where we note that Work is most often referred to in NEU notes and lowest in GSN notes, which could be due to blogging often being used for marketing and advertising [67].Money is most often referenced in GSN notes and lowest in DL notes, where this might be due to the fact that [68] lists these two topics as some of the most common reasons for a person to die by suicide.Religion is most commonly referenced in GSN notes and lowest in DL notes, where [57] has found that the topic of Death is commonly referenced in suiciderelated communication on Twitter.This was also found in this dataset, where GSN notes most commonly referenced death, whilst DL notes were least likely to reference this topic.Furthermore the references to Leisure are highest in the NEU corpus and lowest in GSN notes.References to Home were highest in GSN notes and lowest in NEU notes, which might be due to GSN writers often leaving instructions behind [29], which could references places within a house.

Time Orientation and Relativity
Looking at the Time Orientation of a note can give interesting insight into the temporal focus of attention and differences in verb tenses can show psychological distance or to which extend disclosed events have been processed [56].Table 7 shows that the focus of DL notes is primarily in the past whilst GSN and NEU notes focus on the future.The high focus on the past in DL notes could be, because these notes might draw on their past experiences to express the issues of their current situation or problems.The most frequent use of future tense in GSN notes could be due to the writer leaving behind instructions for others [10].Relativity refers to references to space, motion and time in a note.

Cohen's d effect size
Cohen's d effect size was used to calculate the pairwise importance [69] of each feature.An effect over d=0.2 (highlighted blue) indicates a small effect, d=0.5 (highlighted green) indicates a medium effect and d=0.8 (highlighted yellow) shows a large effect.Furthermore, [69] argued that an effect size of d=0.5 or higher should be easily seen by humans in real-world examples.It can be seen in Table 8, that most features have a small effect (36.48%), whereas both medium and large effects make up 22.97% and 6.08% of the features respectively and should be clearly visible when examining any posts or notes.Furthermore, it can be seen that categories such as Dimensional Analysis, Affect or Function words show a medium to large effect size across its subcategories, whereas Cognitive Processes seem to only have a small to medium effect size for GSN to DL pairwise comparison.Also, it can be seen that features such as Word per sentence, Adjectives or Home do not have any effect on any on the datasets.Other features such as Clout, Tone, Anxiety, Anger, Insight, Tentativeness and Focus past do not appear to be important when measuring statistical significance between GSN3 and NEU1 / NEU2 posts.In comparison, there is only one feature (Leisure) that is not statistically significant when comparing GSN3 to DL2 / DL3 notes.When comparing GSN3 to DL2 / DL3 notes, the Affect category seems to be most important, whereas for a comparison of GSN3 to NEU1 / NEU2 the Function word category is most significant.Therefore, it could be argued that in future work, a more fine-grained analysis of sentiment would provide more insight and distinct features to accurately classify suicide notes from depressed notes.On the other hand, for a comparison of suicide notes to 'neutral' posts, a focus on function words seems most appropriate.
Overall, the category that seems most important across all three datasets is Function words, where only one feature (Negations) is not statistically significant when comparing GSN3 to DL2.

LEARNING ALGORITHM
Recurrent neural networks (RNNs) are well suited towards natural language processing tasks due to their ability to handle sequential data [70], however there are still shortcomings which ultimately effect the accurate classification of longer sequences.This is mainly due to the problem of vanishing and exploding gradient descent [71], which impacts on the RNNs ability to maintain mid and short term memory when memorising long-term dependencies.Various approaches have tried to solve the problem of learning longterm dependencies in temporal data, where variations of multiscale RNNs have produced state-of-the-art results on various tasks.Generally speaking multiscale RNNs, group the hidden units of the network into multiple modules that operate on different timescales [72], [73], [74] in order to overcome this problem.
For our implementation of a Dilated LSTM, we follow the implementation of recurrent skip connections with exponentially increasing dilations in a multi-layered learning model by [75].This allows LSTMs to better learn input sequences and their dependencies and therefore temporal and complex data dependencies are learned on different layers (see Figure 4).

Dilated LSTM with ranked units
Each document D contains i sentences S i , where w i represents the words in each sentence.Firstly, we embed the words to vectors through an embedding matrix W e , which is then used as input into the dilated LSTM.
The most important part of the dilated LSTM is the dilated recurrent skip connection, where LST M (l) t is the cell in layer l at time t: t as the input to layer l at time t; M and L denote dilations at different layers: We extend the standard dilated LSTM in two ways for our experiments.The standard dilated LSTM alleviates the problem of learning long sequences, but not each document has the same sequence length, so in order to overcome this variability we provide fixed boundaries to each layer by reducing the number of hidden units per sub-LSTM hierarchically.Therefore larger sub-LSTMs focus on learning long-term dependencies, whilst smaller sub-LSTMs focus on more frequently occurring short-term dependencies.This leads to improved performance as it has been shown in other contexts [72], [74].
Dilated LSTM with stacked units Furthermore, we extended the earlier implementation with an attention mechanism inspired by [76], using attention to find words that are most important to the meaning of a sentence at document level.We use the output of the dilated LSTM as direct input into the attention layer, where O denotes the output of final layer L of the Dilated LSTM at time t +1 .
The attention for each word w in a sentence s is computed as follows, where u it is the hidden representation of the dilated LSTM output, α it represents normalised alpha weights measuring the importance of each word and S i is the sentence vector:

EXPERIMENTS
We conduct two different classification experiments, where in both set ups we use a Maximum Entropy classifier to establish a performance baseline.This is due to its suitability to textual data where conditional independence of the features cannot be assumed.Additionally we chose to benchmark our algorithm against the originally proposed Bidirectional LSTM with attention proposed by [76], as it also utilises attention.Furthermore we benchmark the Dilated LSTM with ranked units against two other types of RNNs.We use 200-dimensional word embeddings as input into each network and all neural networks share the same hyperparameters, where learning rate = 0.001, batch size = 128, dropout= 0.5 and the Adam optimiser is used.Furthermore we use the full sequence length of each document as input.
For our proposed model -the Dilated LSTM with ranked units -we establish the number of dilations empirically.
There are 2 dilated layers with exponentially increasing dilations starting at 1.The number of hidden units is adjusted according to the sequence length used as input to each sub-LSTM, where the number of hidden units is always half of the given sequence length.For example, given a sequence length of 160 and 2 dilations the input length to the sub-LSTM is [160,80], whilst the number of hidden units adjusts from 80 to 40.For all other learning models the number of hidden units is set to 300.For experiment 1 we use the GSN, DL1 and NEU1 dataset, which gives us an overall dataset size of 633 posts.Due to the small size of the dataset we use k-fold cross validation, where k = 10.Whilst for experiment 2 we use GSN, DL2 and NEU2 datasets, where the overall dataset is 5004 and we split the data into 80% training, 10% validation and 10% test data.

RESULTS AND EVALUATION
In the following section we outline the results for both experiment 1 and 2 and provide an evaluation.

Experiment 1
All results are shown in Table 9 and we use precision, recall and f1-score as our evaluation metrics.It can be seen that the Dilated LSTM with ranked units and an attention layer outperforms both established benchmarks by 21.93% (Maximum Entropy) and 4% (BiLSTM with attention) respectively.This is due to their ability to handle sequential data of variable length, where as the networks' units decrease hierarchically the information is better retained and different timesteps.Of particular interest are the results of the vanilla LSTM as they are considerably below the Maximum Entropy classifiers baseline and the next related model, the Bidirectional LSTM.Taking into account earlier observations that LSTMs may struggle to learn sequences above a certain length given a small dataset, we conducted an additional experiment where the sequence length was restricted to 100.In particular it has been established previously [77], [78] that any vanilla recurrent neural network trained with stochastic gradient descent on a sequence of more than ten time-steps will struggle to learn long-term dependencies.This experiment yielded substantially better results with an f1-score of 0.66.However, this has also meant that over 50% of the documents used in these experiments were cut short and not all information available was utilised.In Figure 5, three confusion matrices are shown, which demonstrate how well the dilated LSTM with ranked units does compared to the baseline and the best comparable model.Firstly, it has to be noted that in all three figures NEU posts are most accurately classified, then GSN notes and finally DL notes.Figures 6,7 and 8), where words highlighted in darker shades have higher attention weights.One of the main differences in these three types of documents it is the usage of personal pronouns, where in GSN notes there is frequent usage of 'you', whilst both other documents mainly refer to the first person singular or plural.It can also be seen in Table 8 that personal pronouns have a large effect size for GSN/NEU1 and small effect sizes for GSN/DL1.There are a range of different topics and emotions present in each document.More specifically, emotions in GSN notes such as love, joy and peacefulness are present, whilst in DL blogs anger and hate are predominant.Table 8 also shows that there are small and medium effect sizes for GSN/DL1 comparisons, but fewer effect sizes for GSN/NEU1.This can be seen in NEU1 notes use less emotionally intense language when discussing topics and seem to talk about multiple aspects of a topic.Furthermore the DL1 blog mentions suicidal ideation, however from a linguistic and sentiment perspective it is clearly distinct from a GSN note.The results for experiment 2 can all be seen in Table 10, where we also use precision, recall and f1-score as an evalu-ation metric.It can be seen in table 10 that the dilated LSTM with ranked units also outperforms the baselines and comparable learning models by more than 10%.Furthermore we note that when establishing a baseline, using the Maximum Entropy classifier, the f1-score is lower than in experiment 1 which reflects how much harder the task is when using an imbalanced dataset.Using the original sequence length on the LSTM in this experiment also shows that there is an improved performance.Overall it can be seen that all neural network approaches outperform the classification results of the baseline and are considerably higher than results from experiment 1.Firstly, this could be due to the increased data size which naturally help neural networks to perform better and secondly it could also be argued that the different learning models find it easier to classify NEU posts due to the imbalance in the dataset.Figure 9 shows three confusion matrices comparing the best performing model to the baseline and the best competing model.Overall it can be seen that the baseline model only classifies NEU posts correctly and only 1 GSN note, whilst it assumes that most DL notes are NEU posts.When comparing the results of the Bidirectional LSTM with attention to the dilated LSTM with ranked units it can be seen that the latter is able to also classify both GSN and DL notes more often.It could be argued that this is due to the learning models ability to access the full sequence length.

Linguistic Evaluation
Figures 10, 11 and 12 all show correctly classified examples of each dataset.It can be seen in the GSN note (see Figure 10), where similar to the findings in the linguistic analysis and for the linguistic evaluation in section 6.1.1.Personal pronouns ('you'), positive emotions ('love') and a increased focus on the present ('is') seem to be most important for accurate classification.Similarly in DL2 notes (see Figure 11) references to death ('im dying inside') and work ('unemployment') as well as negative emotions ('hate'/'angry') and a increased focus on the past ('had'/'used') are assigned the highest attention weights.However, in NEU2 notes (see Figure 12) there seem to be less personal pronouns, increased use of adjectives and adverbs ('burly', 'beefy' or 'creepy') and there seem fewer references to emotions.These findings also correspond with the small to large Cohen's d effect size that was calculated pairwise for each dataset.

CONCLUSION
In this paper we introduced the Dilated LSTM with ranked units and have shown that the learning model is able to successfully distinguish suicide notes from both depressed    that accurate classification of suicide notes was easier when the dataset was balanced.However, we have also found that when using the dilated LSTM with ranked units on an imbalanced dataset that makes the overall task more realistic, it was able to identify more suicide notes compared to other learning models.The learning model outperforms the baseline of 60.73% and when using F1-score for evaluation in experiment 1 and achieves and F1-score of 96.1% in experiment 2. Furthermore we have shown that it is possible to achieve better results when significantly reducing the sequences length in a standard LSTM on a small dataset in experiment 1. Therefore demonstrating that accurate classification is possible solely on linguistic patterns in this type of textual data.Therefore linguistic differences could substantially contribute to future analysis of mental health issues online.Furthermore, we have shown by visualising attention weights which words are most important to each text category.However, additional research is needed to understand if, for example, these language patterns generalise over larger datasets and which role emotions expressed and topics discussed in textual data could help further to identify suicidal ideation.Given further research such work could be useful in a number of scenarios, including but not limited to assessing the seriousness of a social media post or suicide attempt in a clinical settings.Nina Dethlefs, is a Lecturer in Computer Science at the University of Hull, UK.Her research interests lie at the intersection of natural language processing and machine learning particularly in the areas of natural language generation, interactive systems, text mining and social media, as well as domain transfer and adaptability.She holds a PhD in computational linguistics from the University of Bremen, Germany.

Figure 1 :
Figure 1: Example of a suicide note.

Figure 2 :
Figure 2: Example of a depressed post.

Figure 3 :
Figure 3: Example of a blog post.

Figure 5 :
Figure 5: Three confusion matrices comparing the label classification in three algorithms for experiment 1

Figure 6 :
Figure 6: Example of a correctly classified 'GSN' note

Figure 7 :Figure 8 :
Figure 7: Example of a correctly classified 'DL' note (a) Maximum Entropy (b) Bidirectional LSTM with attention (c) Dilated LSTM with ranked units

Figure 9 :
Figure 9: Three confusion matrices comparing the label classification in three algorithms for experiment 2

Figure 12 :
Figure 12: Example of a correctly classified 'neutral' blog Annika Marie Schoene is a PhD candidate in Natural Language Processing at the University of Hull and is affiliated to IBM Research UK.Her research focus is on investigating recurrent neural networks for fine-grained emotion detection in social media data.She also has an interest in mental health issues on social media.Alexander P. Turner is an Assistant Professor in the Department of Computer Science at the University of Nottingham.His current research interests focus on the application of artificial intelligence techniques in healthcare.He received a PhD in Electronic Engineering from the University of York in 2014.Previously he was a lecturer in the Department of Computer Science at the University of Hull.Geeth de Mel is a Research Scientist with IBM Research Europe (UK).His research interests are in artificial intelligence-especially in Semantic Web technologies-and on decision support systems in the presence of (or lack of) dynamicity, trust, and provenance.He holds a PhD from the University of Aberdeen, Scotland.

Table 2 :
LIWC Function and Content Words

Table 4 :
LIWC Social ProcessesThe term Cognitive processes encompasses a number of different aspects, where it was found that the highest amount of cognitive processes was in DL notes and the lowest in blog posts (see Table5).

Table 8 :
Cohen's d effect size s (l) is the skip length; or dilation of layer l; x

Table 9 :
Results of Experiment 2 using precision, recall and f1-score

Table 10 :
Results of Experiment 2 using precision, recall and f1-score