NLP Getting Stated Tutorial
1 2 3
| import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn import feature_extraction, linear_model, model_selection, preprocessing
|
1 2 3
| train_df = pd.read_csv('/content/drive/MyDrive/kaggle/tweetDisater/data/train.csv') test_df=pd.read_csv('/content/drive/MyDrive/kaggle/tweetDisater/data/test.csv')
|
an example of what is not a disater tweet
1
| train_df[train_df["target"]==0]["text"].values[1]
|
'I love fruits'
an example of what is a disater tweet
1
| train_df[train_df["target"]==1]["text"].values[1]
|
'Forest fire near La Ronge Sask. Canada'
- CountVectorizer
- 단어들의 카운트(출현 빈도(frequency))로 여러 문서들을 벡터화
카운트 행렬, 단어 문서 행렬 (Term-Document Matrix, TDM))
모두 소문자로 변환시키기 때문에 me 와 Me 는 모두 같은 특성이 된다.
1 2 3 4
| count_vectorizer = feature_extraction.text.CountVectorizer()
#처음 다섯개 트위터 count example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])
|
1 2
| print(example_train_vectors[0].todense().shape) print(example_train_vectors[0].todense())
|
(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]
- 처음 5개 트윗에 54개의 단어(tokens)들이 있음
- 첫번째 트윗은 54개의 unique words 중 몇개만 포함함. 0이 아닌것은 첫번째 트위터에 존재하지 않는 단어(token)
모든 트윗에 대해 vetors 만듬
1 2 3
| train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])
|
1 2
| clf = linear_model.RidgeClassifier()
|
1 2
| scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv =3, scoring="f1") scores
|
array([0.59453669, 0.56498283, 0.64082434])
1
| clf.fit(train_vectors, train_df["target"])
|
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
max_iter=None, normalize=False, random_state=None,
solver='auto', tol=0.001)
1
| sample_submisstion = pd.read_csv("/content/drive/MyDrive/kaggle/tweetDisater/data/sample_submission.csv")
|
1
| sample_submisstion.head()
|
|
id |
target |
0 |
0 |
0 |
1 |
2 |
0 |
2 |
3 |
0 |
3 |
9 |
0 |
4 |
11 |
0 |
1
| sample_submisstion["target"] = clf.predict(test_vectors)
|
1
| sample_submisstion.head()
|
|
id |
target |
0 |
0 |
0 |
1 |
2 |
1 |
2 |
3 |
1 |
3 |
9 |
0 |
4 |
11 |
1 |
1
| sample_submisstion.to_csv("submission.csv", index=False)
|